Hacker News new | past | comments | ask | show | jobs | submit login
Apple downloads ~45 TB of models per day from our S3 bucket (twitter.com/julien_c)
553 points by julien_c on Sept 16, 2019 | hide | past | favorite | 225 comments

"Almost everyone" working on NLP uses one of hugginface's pretrained models at one point or another, sooner or later: https://github.com/huggingface/pytorch-transformers

It's so damn convenient, and so nicely done.

And they keep doing neat things like this one: https://github.com/huggingface/swift-coreml-transformers

Kudos to Julien Chaumond et al for their work!

For anyone else initially confused, NLP in this context is "Natural Language Processing."

As opposed to?

neuro-linguistic programming, for one

Or nonlinear programming

> Swift Core ML

Why does he call them Swift Core ML? They are core ml, usable in swift and objective-c.

That repo is written in Swift, hence it is called Swift Core ML Transformers.

This is probably apple's continuous integration tests, lazily written to download the whole thing every time someone merges a commit.

that's really stupid. I mean, I would have set a cache repository (SonaType Nexus maybe?), download everything there and use that repository. In the tweets the author says they've blocked the download from Apple IPs, so now their pipeline is broken.

I bet that 95% of all in-house CI would break if it doesn't have access to the internet. I also bet that 95% of those wouldn't need to have access if they were properly designed.

We rarely hear about CI servers being taken over but it has to happen frequently enough.

Most enterprises use Artifactory or something similar -- or should anyway. Once you start enforcing "no internet for CI" you start to see how poor some ecosystems are. I'm looking at you, Javascript ecosystem packages, with your hardcoded mystery URLs that you sneakily download artefacts from...

We had a situation at work where some developers really got attached to downloading npm modules at container start time to make builds fast. But then starting the containers took 15 minutes :P

Surely lots of orgs would eschew packages with such sneaky behavior? Turning off the network would be a good way to test for that...

You're absolutely right. It's the reason we always use devpi for python mirrors, and firewall off CI.

Ultimately you need to be able to rebuild the product when upstream goes poof.

Fairly sure that was the vector for the big matrix.org breach.

Unfortunately it's very common for companies to set up their CI without any form of caching. I think it's mostly because developers are under time pressure from their managers. In some cases it's because CI is set up by juniors who don't fully understand the tools and the consequences of setting them up at this scale.

It's also not a problem until it is. Then, you can devote resources at it, but meanwhile you got things up and running for months/years faster than if you tried to get everything setup just so from the get go...

So they can fix it, now, that is is a problem, and they didn't have to spend time worrying about that before.

Sounds very smart on their party: move forward with what matters (building your codebase, tests, etc) and don't do something (like an internal cache), unless you have to -- the ops equivalent of lazy-loading...

I was asked to scrape some prediction website millions of times every day. Instead of doing that, I reverse engineered their prediction curves from a few hundred data points and served the models myself. Not sure which one is more ethical. Scraping at scale or stealing the models using mathematics. But I know second one is cooler.

Every medium-sized org I’ve worked at has put a caching layer in front of their build dependencies, so builds aren’t blocked when GitHub/PyPI are unavailable. No build/release engineer would leave that trivial door open if they were responsible for the build

Sounds like the build pipeline was set up by a regular dev.

It's smart but pretty fucking rude tbh.

It's not that smart because they might throttle the bandwidth as a result. Also, one day the downloading may take longer than the time between tests and the entire test will start failing. Further, if you want to test the testing procedure, you want a cache anyway.

The data should probably be distributed using e.g. git, so that downloads are incremental by default.

If you host large, publicly available data in a cloud blob service, but you don't have a budget for it, one option is to use the "Requester Pays" feature that Amazon and Google provide. This makes the data available to anyone to download, but they need to pay the download cost themselves.

This is at the tradeoff of making your data significantly more irritating to access, as it's no longer just plugging in a URL into a program, plus everyone who wants your dataset needs to set up a billing account with Amazon or Google.

Or just post a magnet link.

Sure, that's a great option for helping to reduce the cost for well-meaning general use, but the other way makes your costs 100% predictable, which is great if you're on an academic budget (but, again, way more annoying for the downloaders unless they're also using AWS).

So if there's no other seeders, you end up eating the full cost as the only seeder....

torrents were supported by S3 in the past https://docs.aws.amazon.com/AmazonS3/latest/dev/S3Torrent.ht...

That probably wouldn't work here. That s3 bucket is hosting models downloaded at runtime/startup [1] and looks like under normal runs it would be cached. If this is being used in Apple's CI pipeline though the whole thing is being torn down between builds so every build and test has to fetch it again.

[1] https://github.com/huggingface/pytorch-transformers/search?q...

Or use Cloudflare in front of it, especially with mostly static data.

Cloudflare generally doesn’t want to cache data files on non enterprise plans (the terms generally disallow it). Only websites HTML and related content.

I don't have high hopes for his business prospects if this is how he handles one of the richest companies in the world clearly having a high need for something his company offers.

Maybe spend less time on Twitter and more on your business model?

They're basically bragging they have something Apple really wants. Now they have a bunch of people at least interested in what they got. I'll say that's not a bad PR.

Also a great way to throw out massive red flags to any enterprise user that cares about privacy and non-disclosure.

IP address data is pretty sensitive information, and throwing it out there like this, even in aggregate, is not OK because of what it shows.

No matter how much PR this gets, this goes both ways.

Apple publishes that IP range (CIDR address block) in several KB articles on its own website for system administrators to configure firewalls/web filters.



That's rather beside the point. Just because I advertise my IP doesn't mean I want my service providers to disclose my activity publicly.

Maybe I'm misunderstanding something here, but does apple pay them for this?

The point of this thread is that maybe Apple would pay if they were asked.

And how would you be contacting Apple, from your little 3-person startup in Paris ? You assume they have the means or contacts to do that; and IMHO the tweet is not that aggressive.

It's been done before, (e.g Intel has been called out for consuming kernel.org bandwidth and git CPU power) and is the simplest way to have people from inside BigCorp get a message.

No need to argue with me. I'm only explaining the original point, not making it myself :-)

But reviewing the thread I see you your question was in response to someone calling them "service providers" to Apple. So your question was entirely justified and it was me who'd lost context.

Lots of quid pro quo indeed, no harm done :-)

I want to add an answer that to those saying that gives information about Apple: it only says that somehow, one team inside Apple has setup a CI (badly written script) with maybe 5k tests (from a standard set for instance) and has 9 commits per day. Or maybe they have more commits, and less tests ? Or maybe, it's a matrix of 70x70 tests. Or maybe… well, all it says is that someone is experimenting with this.

Blacklist the IP range and throw 404s with "please contact us, we're throttling you because load".

> my service providers

Someone you get freebies from isn't a "service provider".

This isn't sensitive information. Anyone with a BGP session can have this information. [1]

[1] = https://bgp.he.net/AS714#_prefixes

It is public those are Apple’s IPs. It is not public they are downloading data from this S3 bucket, never mind how much.

Pretty amazing that a single company can own an entire block of IP space, if I understand this correctly. Approx how many addresses is this?

I used to work for HP and someone explained to me a select few companies got /8's when the internet was still young. HP got one, Compaq had one which HP now also owns. I was basically told if you had a /8 you didn't give it up because of how valuable and rare they now are (this was around 2010, too). GE, Kodak, Apple, and Microsoft were a few other names that came up in that discussion as well.

GE actually sold their entire block off to AWS a few years ago.

It's a little awkward since a lot of internal software is still configured to whitelist all access from that space since it was a constant for so long.

We called this threxit internally. (Get it? Three dot exit? ;))

And as far as I know we haven’t stopped threxiting — at least they hadn’t when I left. It turns out unwinding IT systems that have had stable IP addresses for 30+ years in a year or two is tricky business.

I wonder what became of the Kodak block? Seems GE would find more use/investment for IP space than a failed film company (IIRC)...

Funny enough the highest price point for IP ranges is somewhere between /16 and /24, IIRC.

You can count how many companies need and will be willing to pay 8-9 figures for a /8 without getting to your toes. And subnetting it and selling it to maximize returns is hard work.

But if you’re sitting on a /21? That’ll move before you can count how many IPs are in the block ;)

Why are they so valuable?

Because there are only four billion addresses all in all. Every device that wants to be reachable on the Internet needs one. These days, mobile devices don't get a public address anymore and there are all sorts of complications due to it.

We're slowly transitioning to a new scheme with ample address space. But to nobody's surprise it's taking decades longer than envisioned.

Hopefully it's surprising to the envisioner(s)!

I assume it was willful optimism on the part of the envisioners. If you tell people it will take decades it will take even longer!

Map: https://www.caida.org/research/id-consumption/census-map/ima...

(a bit dated, but shows the historic allocations)

Look up "CIDR", history thereof, for reasons why it looks this way.

Crazy how wasteful it is. I wonder what genius thought to allocate /8 to every company/organization. You don't need to have PhD in statistics and math to know there's more than 250 companies.

I wonder who will think about the genius who disbanded the EPA, rolled back every environmental protection there is and withdrew from the Paris Agreement at the most critical time for our planet in 40 years.

Hindsight is 20/20 and the „Internet“ was a mainly US centered university research project that was thought of as a toy by the far majority.

Everybody thought they‘d have a replacement for the initial assignment once things got serious... for more fun, google ipv6 history ^^

Wouldn't the more apt comparison be to the folks who set up the EPA in the first place (Nixon administration I believe)?

Maybe take a dose of humility and realize that at one point the fastest processors and memory systems in the world weren't capable of holding more than a limited size routing table, while maintaining acceptable line speed?

And that in the interests of working within the physical hardware limitations of the day, very smart engineers made the best choices they could?


What this has to do with anything? I'm saying that giving whole /8 (or I should say class A) to a company is wasteful. And you can only do it no more than a bit over 200 times. You are on the other hand saying that the hardware at the time wouldn't be able to handle all the companies. Why not allocate C blocks, or at very least B blocks? Or are you saying that they doubted hardware of the future would be capable of handling it?

Because of such wasteful allocation we got this "wonderful" thing called NAT which basically killed most of innovation in area of networking and IPv6 which is taking over 20 years to adapt, because most ISPs hold to IPv4 as long as they can because making this switch requires some work.

Well, same thing for the geniuses that made IPs 32 bit when even MACs are 48 bit.

Or, if my networking trainer at a Cisco course is to be believed, the geniuses that made IPv6 subnets contains 65k hosts at a minimum, when due to ARP requests all traffic would be dead at that scale.

Wow, this is very cool! Thank you.

The post office has/had one

The DOD has like 5 or 6!

well since DoD started the whole Internet thingy that's not that surprising, but yeah they were extra wasteful.

Loop back address,, is really a class A block,, that only uses 1 address

You can use anything from that block as well and there are situations where you might want a different loopback address or multiple loopback addresses. /8 of the public address space is massively excessive though.

16.7 million or so.

IPv4 is a 32 bit address space, so it tops out around 4.2 billion total. is locking down the first 8 bits, giving 2^(32-8) variable bits, or there are only 256 possible first octets and this is one so it’s 1/256th of 4.2 billion addresses.

It’s even more when you factor in link local and private network ranges.

Everybody has those for themselves.

That’s what I’m saying. The set of addresses allocated to Apple (in their /8) is an even larger ratio as the private ranges don’t count towards the denominator.

Apple has a /8 of IPv4 space (plus a couple other small allocations: https://whois.arin.net/rest/org/APPLEC-1-Z/nets ), which contains 16777216 addresses: https://en.wikipedia.org/wiki/List_of_assigned_/8_IPv4_addre...

Well that should just be about enough for each customer on Apple iWeb Cloud Services to have their own ip address once it gets started.. :')

It's a /8, which is 2^(32 - 8) = 16.7 million address.

The entire used to be owned by a company that provided IT services to the Norwegian public sector. They had 4 IPs for every citizen in the country, and change.

Desktop, laptop, phone and tablet. Damn, ran out of IPs for the XBOX.

You could (and should) do that with IPv6. Down with NAT! :)

Network address, gateway, one usable IP address, broadcast address :-)

This map was drawn 13 years ago so it's heavily out of date but it does illustrate the companies that got their ip /8 blocks back the day (Ford?!) https://xkcd.com/195/

Ford, Prudential, Halliburton, USPS, Department of Social Security of UK. Someone must have been really high when deciding this.

BTW: this one is better: https://www.caida.org/research/id-consumption/census-map/ima...

Won't every vehicle soon have its own IP address? Many must have them already. IoT means some devices need multiple IP addresses

Ford was assigned it's netblock in 1988. It was definitely not assigned with IoT in mind. And if they deployed IoT today, they would just use the mobile telco's dynamic IP addresses and communicate through HTTP.

There are many of them (obviously not very many) - Universities, too iirc. Most have divested by now.

It correlates strongly with big organizations who were "into" networking early on.

There is no expectation of anything without a paid contract. This isn't a major issue though, no enterprise is going to care about their team freely downloading generic public ML data from the internet. If they did care, they would already have other arrangements.

Oh, so now IP addresses is PII? When it is inconvenient for FAAGM monster corporations? I seem to remember a few hundred thousands corporate statements that tracking individual IPs is totally ok and not surveillance.

In the Netherlands an IP address is legally PII. I'm not sure this is true for publicly available IP ranges, especially if they're owned by companies though (not like a person would own a range), but probably not.

IP addresses are definitely PII under GDPR.

There is no concept of PII under GDPR. There is personal data (information about a natural living person) and identifiers (information that connects personal data with an identifiable natural living person). An IP address is usually an identifier - it's not completely unique, but it is potentially enough (especially when combined with other identifiers) to uniquely identify the subject of a piece of personal data.

This tweet is overwhelmingly unlikely to be a breach of GDPR, because the controller (Julien Chaumond) has no ability to correlate an entire /8 range with a natural living person. Nobody is identifiable, there is no personal data, therefore the activity is not in scope.


Do you have a source for that?

My understanding is that it is only PII if it's in conjunction with other data in particular ways.

When the GDPR came along and IP Addresses were being mentioned as PII, my employer required us to sign a document stating (in part) that we wouldn't access, download or communicate PII data except when specifically authorised to do so.

When I refused to sign that and ran it up the flagpole with questions about how I'd do my job which occasionally included things like blocking IPs and dealing with network captures, it (apparently) went to lawyers who came back with a revised document to clarify that IPs wern't PII unless used in specific ways.

My point being - that specifically showing the 17/8 range as an aggregate shouldn't violate the GDPR any more than mentioning that it's Apple being the source of traffic.

Not hard to find: https://eugdprcompliant.com/personal-data/ state "The conclusion is, all IP addresses should be treated as personal data, in order to be GDPR compliant."

Obviously different jurisdiction have a different notion of PII.

We agree on the conclusion though: a public range for a company probably doesn't count as PII.

Is that page legally binding?

Does it cite anything legally binding?

No? Then it's not a source.

>The GDPR states that IP addresses should be considered personal data as it enters the scope of ‘online identifiers’. Of course, in the case of a dynamic IP address – which is changed every time a person connects to a network – there has been some legitimate debate going on as to whether it can truly lead to the identification of a person or not. The conclusion is that the GDPR does consider it as such.

>.... The conclusion is that the GDPR does consider it as such.


The article just spews word vomit and then makes a conclusion on behalf of the GDPR without even citing a single bit of the GDPR to back up its arguments.

You should never take legal advise from a website that doesn't cite the legal text.

Sources: Article 4(1) and Recital 30.

Either way Apple is not a Person with data rights in this scenario.

You are not going to find anything in the GDPR that cleary state if IP are personal data or not, because the GDPR is tech-agnostic and thus doesn't use tech-specific terminology such as IP.

Article 4 [1] states > ‘personal data’ means any information relating to an identified or identifiable natural person (‘data subject’); an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person;

Can a person be identified by an IP, which is an online identifier? That's up to a judge to decide, according to context. Some judges do think so (French)[2]. The interesting part there: "Les adresses IP collectées [...] doivent par ailleurs être considérées comme une collecte à grande échelle de données d’infraction au sens de l’article 10 du RGPD". Rough translation : "Collected IP adresses [...] must be considered as large scale collection of offenses data in terms of article 10 of the GDPR" (article 10 is about personal data for offenses and convictions).

You can find a few similar cases where IPs are considered as personal data. It has been a subject of discussion here on HN several times during the GDPR introduction.

Please note that I never said or implied that any of what I said or linked is legal advice or a legal text.

[1] https://gdpr.eu/article-4-definitions/ [2] https://www.legalis.net/jurisprudences/tgi-de-paris-ordonnan...

That was the point of my comment.

> Also a great way to throw out massive red flags to any enterprise user that cares about privacy and non-disclosure.

If you want to have a business relationship with someone, pay them.

big companies are also notorious for reaping whatever they can take from smaller companies...and when its time for the smaller company to monetize..."whoops we don't have budget for that."

They're not showing IP addresses, they're showing /8 CIDR blocks, that's very coarse granularity.

The real information is that Apple is downloading them. Why would they care about granularity?

If Apple didn't want that information to be public, they probably shouldn't have downloaded 45TB of data a day from a service they have no paid agreement with.

I don’t disagree, but that’s not responsive to my (or GPs) point. GP implied the aggregate is what made it OK: if anything that made it more sensitive (not everyone would bother to aggregate and look up who owns the IPs).

If an enterprise user cares about privacy and non-disclosure they'd have a contract with the guy providing the service surely? If they don't... then they really don't care about privacy and non-disclosure that much.

Apple obviously didn't.

> IP address data is pretty sensitive information, and throwing it out there like this, even in aggregate, is not OK because of what it shows.

Apple's ownership of 17.x is so well known it is described in an xkcd: https://xkcd.com/195/.

You're joking right? Lots of people use twitter. It's not just for tweens anymore. This is as good a way as any to get Apple to take notice and maybe send some bucks their way. It's a half joke/half serious attempt

You don't know that this was the only action he took. This was not a great criticism considering how limited our information is right now.

He literally said he blocked their IP range

The comment said only action. He blocked their IP range but probably contacted them too.

if one of the richest companies in the world is hammering your server without paying for it I would hope they don't get their feelings hurt when the server blocks them until they pay for it.

I mean I've worked at some of the riches companies in the world and I think the conversation would have gone like this

Me: hey project manager our access to server X where we get the really needed X1 resource has been blocked. Probably cause we are hammering their server (which we probably talked about some months ago)

Project manager: hmm a paid account will need to go through approval. Can you tell them we are trying to get a paid account, see if you can get them to give us temporary access, and maybe we can cache the result so we don't hit their server that often.

Me: ok I'll do that.

In fact I mean if I was on a small company providing a service that a really big company was abusing in such a way that I think they might like to pay for it - I think the smart thing would be to force them to contact me somehow because how do I even call the part of the big company I need to talk to.

Can you tell them we are trying to get a paid account, see if you can get them to give us temporary access, and maybe we can cache the result so we don't hit their server that often. Me: ok I'll do that

I’m not sure I would feel comfortable doing that - stringing along a fellow engineer for the benefit of a freeloading PM.

in the scenario I was envisioning, the big rich company wants to get a paid account now that they need it but to do so they will have to go through a process to get it approved which could take a while.

That tweet will get attention! Finding the right person is everything...

For better or worse, positioning yourself in a David / Goliath with Apple is almost always newsworthy. Interviews with the Apple underdog makes easy content.

This looks kind of interesting:


When you look further down you find:


And that's just a quick search for s3 in the repo. It would not surprise me in the least to discover a `from_pretrained` that points at one of the s3 resources being pulled. There's probably other stuff like that as well in the code that could be causing equally nasty heartache .. especially if non-persistent containers are involved....

(This is a WAG aka Wild A Guess)

EDIT: Dug a little more and found:


Unless I'm mistaken here, there's a crap ton of code that could be downloading models at runtime ... Which seems significantly less than ideal ...

It's also possible it's part of a docker build step or similar. Even if they're aren't downloading models at run time they may be loading s3 if their pytorch-transformers lib docker cache gets invalidated frequently.

Docker builds are crazy wasteful in terms of bandwidth and compute. Right now I'm struggling with a project that builds ITK on demand every heckin time.

I'm working through how to best integrate apt-cacher-ng, sccache, and a pip cacher. It costs me nothing to hit apt or pypi, but like, somebody is paying that bill. A little perspective goes a long way.

I wonder if I could do something to just proxy all requests and cache those on a whitelist and stick it on my CI network.

Based on this, they might not even realize they are downloading this stuff ...

A brief reminder: Whenever you publish code or documentation that might be used/scraped by the outside world, ALWAYS use a domain you own. If you're on Cloudflare you can instantly (and for free) create Page Rules to use Cloudflare as a CDN, redirect to another CDN, or black-hole or reroute traffic anywhere you want.

When working on APIs meant to be used client-side (especially mobile clients) by different customers are partners, use one subdomain per integrator. If there's a bug in their integration, it could easily DDoS your servers, but DNS is an easy way to have a manual kill switch.

Literally had this happen to us (website, not API) from a misconfigured partner last week - they accidentally misrouted unrelated click traffic through our servers. A 2 minute Page Rule and we not only saved our servers, we protected our partner's brand until they could hotfix. We could have done this with a rule looking at the path, but not easily something looking at obscure auth keys. Segmented traffic is happy traffic.

Not to mention that if Cloudflare CDN was in front of it this traffic would be free.

Not with this amount of traffic it won't be free. 45 TB per day lol Cloudflare will be disabling your account and in contact for payment in a hurry.

Go ahead and try, see how far their "free" tier really goes.

For comparison, that's just about half the amount cdnjs delivers every day (on average) https://github.com/cdnjs/cf-stats/blob/master/2019/cdnjs_Aug...

Just waiting for the Cloudflare CEO who lurks around to pop in here and offer it for free.

Not sure about that. Especially if it all goes to one ip address where they have peering arrangements. It could cause some load there but the traffic will be essentially free for cloudflare. And some good publicity for Cloudflare.

I'm skeptical of the number of 500+ MB files the CloudFlare CDN would actually cache...

Does anyone have any numbers on this?

Correct; Cloudflare doesn't cache large asset files (I think anything more than 2MB?) by default. It's not that kind of CDN... at least, not for free it's not.

Of course, you can trick Cloudflare into caching your large media assets using some funky Page Rules... but I wouldn't suggest it. Mostly just for moral reasons. If you have that much traffic, you should be making some money off it and then paying Cloudflare with it!

512mb: https://support.cloudflare.com/hc/en-us/articles/200172516-U...

And a "cache everything" page rule tends to cache literally any file type, but it's not a great idea to push media files through CF due to the TOS prohibiting "disproportionate amounts of non-web content".

2mb might be referencing the limit for Workers KV.

Ah, much better than 2MB. An image for a high resolution display can hit that easily. Less so 512MB.

Good point. I guess for free plans they only cache up to 512mb files [1]. It does seem you can set page rules to cache large files by extension [2].

1. https://support.cloudflare.com/hc/en-us/articles/200172516 2. https://support.cloudflare.com/hc/en-us/articles/11500015027...

huh? why publish if you don’t want it used?

Well, you could contact them and make a very-likely-to-succeed case that they should pay you some money, or you could complain about it on Twitter.

Twitter will probably be a faster response than automated email inboxes at Apple

The executive team email addresses that can be easily found are monitored quite well.

That's about $4000/month in bandwidth costs, assuming retail pricing.

FYI he is bragging, not complaining. There are a dozen ways to reduce or eliminate this problem.

> That's about $4000/month in bandwidth costs

You're an order of magnitude off.

45 TB per day is 1,350 TB in a month, or 1,350,000 GB.

Show me somewhere you can get a petabyte of egress inside a calendar month for 4 figures USD...

Let's suppose you even used the cheaper egress from Cloudfront rather than serving from S3 (lol @ your wallet if you serve 1 PB doing that).


The first petabyte costs an average of $0.045 per GB - $45,000.

The remaining 350 TB costs another $10,500 for a total of $55,500.

Serving off S3 directly? Yeah, that'll be more like $150,000.

> Show me somewhere you can get a petabyte of egress inside a calendar month for 4 figures USD

Correct me if I'm wrong, but most colocation/dedicated server providers offer such prices. E.g. hetzner.com @ €1/TB, sprintdatacenter.pl @ €0.91/TB, dedicated.com @ $2/TB (or $600/month for an unmetered 1 Gbps connection).

But if you want S3/CDNs/<insert any cloud offering here>, then yeah, they're expensive.

BTW per Cloudflare ToS[0]:

> Use of the Service for the storage or caching of video (unless purchased separately as a Paid Service) or a disproportionate percentage of pictures, audio files, or other non-HTML content, is prohibited.

[0] https://www.cloudflare.com/terms/

You confused cloudfront with cloudflare

True, I'm not that familiar with Amazon offerings (only ever used it for Windows GPU instances) and reading too fast got me. Too late for an edit.

Tweet does say it's a 'bucket' though.

Cloudfront is a premium service. It's not a good baseline for bandwidth prices.

Backblaze will send out data for 1 cent per GB, or $13,500. BunnyCDN apparently starts at 1 cent per GB in North America and Europe, and has a fewer-PoP plan that would be $5,550. And let's be fair, you don't need many points of presence for half-gig neural network models.

The other option is buying transit. Let's say we want the ability to deliver 80% of the 45TB in 4 hours as our peak rate. Then we need 20gbps of bandwidth. That's around $9,000 in a big US datacenter. Aim for a peak of 6/8 hours and it's $6,000/$4,500.

How the hell are they paying that bill then??

ingress/egress is usually free within the region AFAIR

Only if both ends of the request are on AWS.

I may be a little off, but definitely not by an order of magnitude. Even high-end hosting at a provider like Akamai can be under $10/TB-mo for such large volumes. And since you don't need any of the hundred value-added services that Akamai, Cloufront, Fastly etc. offer, the budget providers can go a lot lower than that.

Could be a lucrative way of eliminating that problem.

If you don't want someone else to do something that costs you money, you're going to have a bad time if you don't prevent them from doing it.

Amazon has the ability to charge the requester. Pass the buck on.

i dont think this works on public files though, does it?

This is the main purpose of the oft maligned allauthenticatedusers permission in the s3 console - anyone can read it they just need an identity to pay

You can allow anyone to read it and pay as long as they authenticate with AWS so aws can bill them.

Obviously not.

What does hugging face do? Do they implement models from papers and make them available for free?

Yes. I don’t know if that’s all that they do. They often port new Tensorflow models to PyTorch as well. They provide straightforward APIs, nice documentation, and clear tutorials. I use their stuff pretty regularly.

Do you know their business model? Looks like they are open source company. How they earn money?

They are a startup, 1 year ago their future product was chatbots / text platforms for entreprise. Don't know if they pivoted but I don't think so.

My guess is that they don't earn money yet. They more or less are just out of school so it's a young startup. Actually studied with some of them and I'm still a student.

Their business model is someone like Apple buying them out.

I'm guessing someone at apple internally distributed a dockerfile that pulls that down.

In a few weeks he can just point Apple's IP range to a shared iCloud folder.

That'll get them them Apple's attention pretty quickly I'd assume.

Now that’d be a bit ironic to get a "your account is using excessive data bandwidth " and be able to respond with "incorrect, your ip’s are using too much bandwidth ".

Paid by requester is the feature they are looking for.


That's a very neat feature I never knew existed. I only suspect this will hurt their larger mission at helping many smaller teams and individuals to use the models.

Not really, if you are a small user your cost is negligible. You can calculate how much it would be for a small team but my guess is couple of dollars per month. Apple's use case is still very reasonable and the cost for them also not as bad. They could also split out different customers to different buckets and have big guys pay for it while smaller companies have it for free. There are many options.

Yeah that makes sense. I was thinking in terms of ease of access. If a large organisation makes everyone pay, that means everyone has to arrange accounts and payment methods.

By splitting it out you have no guarantee that the big guys will just use the free one.

I read the Twitter post but did not understand what is happening. Can someone please explain.

Apple employees are using their product, downloading lots of data, not paying for any of it, and the OP doesn't like it or can afford it.

I don't think it's employees as such — even Apple does not have THAT many machine learning people, and they wouldn't download models daily.

Maybe a server farm, where each instance downloads a model when spinning up?

In the last days of my time spent in the XML salt mines, I got in on a conversation with the web masters at w3.org.

You would not believe how many people and how many libraries pull from primary sources directly instead of using local copies of common resources. I found this conversation because I'd just finished fixing that in our code and taking about 5 minutes off the build process.

Let me restate that: We were spending 5 minutes just downloading schema files. In an automated build. Every time, sometimes on several machines at once.

At one point we were trying to convince him that intentionally slowing all requests down by say 500 ms would get the attention of people who were misbehaving. Anyone who was downloading it once and caching would hardly notice the 500 ms. Those running it once per task would be forced to figure their shit out.

I sometimes wonder about all the additional HTTP load to places like debian.org that must come from Docker builds; I don't have any kind of caching in mine, and every single commit causes CI to go and build images and run tests on them.

It's not an issue I really see, but it seems to me that CI infrastructure (Travis, ...) really should have caching proxies in place.

Several places I've worked have pushed Artifactory on us for various reasons from security to resilience to network outages to bandwidth saving.

It's not the worst fate.

It sounds like a misconfigured build pipeline which re-downloads the model for every single build.

Is that normally freely available? Or is Apple accessing data that is publicly accessible but not meant to be shared?

Isn’t this a use case where BitTorrent would shine?

The problem here is "Someone's CI pipelines redownload the same models on every build"

I'd say there's only a 10% chance Apple's firewall would let BitTorrent through, and only a 3% chance the CI servers would maintain a positive seed ratio.

Possibly it might solve the problem because users would cache the resources themselves to avoid the hassle of getting BitTorrent into their CI pipeline...

Not magically? Torrents need seeders. If you are the only seeder, then you will still get the full bill from AWS all the same.

If your CI/CD is re-downloading and re-building everything on every single run, you are not only being wasteful, you're actually more likely to have an outage due to not storing dependency artifacts needed for deploy. Use a local artifact store to be more resilient to failures of servers you don't control (and also save everyone money and time).

Charge, them, money?

Hosting terabytes of data on an S3 bucket where people would download 45TB per month ($0.023/GB == $1000+/month) sounds like a really expensive way to distribute your data to people...

It is download 45 TB PER DAY. Should use the egress data price for the calculation instead of the hosting price.

The graph makes it look like about 90% of the requests are apple, so I think this is intended to be serving more like 4.5TB per day. At that rate, I don't think its a terrible way to handle it, since S3 handles things all of the HA and scaling stuff internally. They clearly didn't expect apple to go whole hog on the service.

You could configure buyer pays, if you wanted to.

That makes it harder for everyone though where companies like Apple should proxy and/or cache those requests to their own internal version rather than hitting that S3 bucket every time. Requiring requester payment would mean it would only really be used by corporations where the author clearly wants a service open to anyone without having to open an AWS account to pay.

Make the bucket buyer-pays, but offer Torrent links as well. Businesses doing CI will pay for S3 usage so they don't have to deal with torrents, end-users will get free torrent access, everybody wins.

I just skimmed AWS requester pays and it seems that you can't set the price, the requester always pays only the AWS's download cost, so why bother setting it as requester pays at all if it doesn't bring you anything?

It brings you the absence of GET requests and the bandwidth charges on your AWS bill. In this case, 45TB a day is going to add a lot to an AWS bill, you can shift that cost to the user that's downloading the files from you.

The S3 bucket is a repository for the models which get downloaded at runtime it seems so a torrent isn't really a good fit for a drop in replacement.

What’s the backstory on this? (Is it something I should already know)

Isn't it likely that someone wrote a script for testing some regression and it keeps running in a loop? I can almost bet that will be the case.

Is this what they call product market fit?

Looks like open source company. What's their business model? Does any one know, How they earn money?

This is what success looks like if you charge money for a good or service.

And now the twitter post is gone? I'm guessing the west coast woke up and someone at Apple said "Wait, you could infer some proprietary information with that information ..."

I can see the Twitter post without issues.

Yeah, the follow_up after clearing some caches it seems to work for me as well. Interesting.

Are those full downloads or just HEAD or range requests from some CI?

They're the ones that made Amazon get those Data Trucks lol

Apple are probably doing "continuous integration" where all assets are re-downloaded from the Internet in each iteration. Tip: put your stuff on Github :P

With models that large you would be paying for GitHub's LFS credits. Those aren't cheap as I recall. Napkin math at $5/50GB of bandwidth per month for 45TB per day it would cost them $135k/mo to use Github. That's over 2x more than the S3 egress charges would be.

Aren’t all the authors of that paper from Apple?

Why can't they use cloudfront?

cloudfront costs money too.

For some reason I thought the prices for transfer were much better than S3, but that doesn't look like it's the case. It's lightly better, at large volumes.

It's surprising that nobody here's mentioned Wasabi, since they have free egress.

There's no such thing as a free lunch.

Wasabi's own docs[1] mention a ballpark definition of "reasonable" as monthly transfer being less than total storage. Above that you'll get a call.

I'm sure they'd still be cheaper than AWS, but it's not going to be $5/TB/month to service the entire world.

[1]: https://wasabi.com/pricing/pricing-faqs/

So these jokers at apple publish a paper by using code from huggingface.

Need to distribute large static content? Looks like a good job for a torrent.

or ipfs using ipns ... so many good solutions, but people suggest bucket where the downloader is paying... or cloudfear /sic

Have you considered cloudflair?

Nobody is going to regularly let you have 45 terabytes of data for free. It's swamping them, especially relative to other users.

> Nobody is going to regularly let you have 45 terabytes of data for free.

You can still use a few dedicated servers on Hetzner though. They'll easily let you serve 200TB / month from each server for $100. It's obviously not comparable to proper CDN though, but it's does work for many companies.

Backblaze B2 + Cloudflare = free outbound transfer (bandwidth alliance).

Making objects in S3 publicly available and then complaining about their extortionate bandwidth charges is...silly.

If you try to serve a petabyte of binary data a month through Cloudflare, they're going to shut you down.

S3 is very expensive, but averaging 4 gigabits 24x7 is not going to be cheap anywhere.

>averaging 4 gigabits 24x7 is not going to be cheap anywhere

A normal upstream (if you were in a colo or some sort of DC) will charge you about $0.50 per meg. So, 4gbps would cost you... $200/mo. [1]

I think people don't realize just how much more expensive IaaS providers are. We're literally talking orders of magnitude.

[1] https://blog.telegeography.com/outlook-for-ip-transit-prices...

$2000, it seems like you dropped a zero. And you're presumably going to need some burst capacity, so call it $4k for a 2:1 ratio if you care about not throttling at peak times.

The big cloud providers are very expensive, but you're getting impeccable bandwidth. Low packet loss, burst what you want even in peak times, fast transit worldwide, minimal variability.

What most people actually don't realize is that they're making that tradeoff implicitly. You're not being ripped off if you need all of these things, but if it's overkill you can get much cheaper bandwidth elsewhere.

Backblaze B2 + Cloudflare would be a perfect combination for hosting a static site. Unfortunately there's no way to map a Backblaze bucket to a domain, so even if you use cloudflare to point www.mydomain.com to it your files still show up at www.mydomain.com/path/to/bucket.

But it certainly works if you need to CDN a bunch of large files.

You can use Cloudflare Workers to rewrite the path.

Those get fairly expensive if you have a lot of requests.

Page rules are free if you only need a few, though.

3 pages rules are free. You need at least 2 page rules per website you want to use with Backblaze and Cloudflare, and that's for bare minimum workability.

For a commercial company paying $5-10 a month to save TBs of bandwidth via Backblaze & Cloudflare is a no brainer. It's just unfortunately not very workable for small projects. Github Pages works with Cloudflare for static sites just fine though.

I had intended to look into Cloudflare before but didn't get a chance. Looks like it's a good time now!

Cloudflare will cut you off and require a paid plan before you get to 1/10th this amount of egress.

Still cheaper than AWS!

If a company the size of Apple finds this that useful, perhaps you should consider charging for your service, rather than just complaining on Twitter about the free usage you appear to have willingly given away?

Or perhaps you have reached out to them, but are for some reason still complaining on Twitter to drum up PR or something?

Regardless, this posting is ridiculously context-free to the point of being click-baity. (But hey, good job, I clicked on it anyway.)

On the flip side huge corporations like Apple should be more careful about not abusing free services set up by people not just because it's a dick thing to drown free services in traffic just because they're free but also it's a pretty big security issue downloading (seemingly at runtime given how much they're downloading each day) from a random bucket you don't control.

"It's your own fault if you didn't foresee a trillion dollar company exploiting the product you made free and open source to help researcher and now are incurring 15K$ bills per month"

And then people wonder why people don't want to make their stuff free/open source. Even when it's free people still think you're somehow entitled and overcharging.

I mean... yes? It doesn't have to be a trillion dollar company. You put tons of useful data on a public S3 bucket and publicize it, and people are going to download it. S3 data transfer isn't free, so I think it's reasonable to expect that, over time, the cost of serving the data is going to be prohibitive without funding it in some way.

> Even when it's free people still think you're somehow entitled and overcharging.

I just explicitly advocated for the opposite of that, so I'm not sure where you're getting that.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact