It's so damn convenient, and so nicely done.
And they keep doing neat things like this one: https://github.com/huggingface/swift-coreml-transformers
Kudos to Julien Chaumond et al for their work!
Why does he call them Swift Core ML? They are core ml, usable in swift and objective-c.
We rarely hear about CI servers being taken over but it has to happen frequently enough.
Ultimately you need to be able to rebuild the product when upstream goes poof.
Sounds very smart on their party: move forward with what matters (building your codebase, tests, etc) and don't do something (like an internal cache), unless you have to -- the ops equivalent of lazy-loading...
Sounds like the build pipeline was set up by a regular dev.
The data should probably be distributed using e.g. git, so that downloads are incremental by default.
This is at the tradeoff of making your data significantly more irritating to access, as it's no longer just plugging in a URL into a program, plus everyone who wants your dataset needs to set up a billing account with Amazon or Google.
Maybe spend less time on Twitter and more on your business model?
IP address data is pretty sensitive information, and throwing it out there like this, even in aggregate, is not OK because of what it shows.
No matter how much PR this gets, this goes both ways.
It's been done before, (e.g Intel has been called out for consuming kernel.org bandwidth and git CPU power) and is the simplest way to have people from inside BigCorp get a message.
But reviewing the thread I see you your question was in response to someone calling them "service providers" to Apple. So your question was entirely justified and it was me who'd lost context.
I want to add an answer that to those saying that gives information about Apple: it only says that somehow, one team inside Apple has setup a CI (badly written script) with maybe 5k tests (from a standard set for instance) and has 9 commits per day. Or maybe they have more commits, and less tests ? Or maybe, it's a matrix of 70x70 tests. Or maybe… well, all it says is that someone is experimenting with this.
Someone you get freebies from isn't a "service provider".
 = https://bgp.he.net/AS714#_prefixes
It's a little awkward since a lot of internal software is still configured to whitelist all access from that space since it was a constant for so long.
And as far as I know we haven’t stopped threxiting — at least they hadn’t when I left. It turns out unwinding IT systems that have had stable IP addresses for 30+ years in a year or two is tricky business.
You can count how many companies need and will be willing to pay 8-9 figures for a /8 without getting to your toes. And subnetting it and selling it to maximize returns is hard work.
But if you’re sitting on a /21? That’ll move before you can count how many IPs are in the block ;)
We're slowly transitioning to a new scheme with ample address space. But to nobody's surprise it's taking decades longer than envisioned.
(a bit dated, but shows the historic allocations)
Look up "CIDR", history thereof, for reasons why it looks this way.
Hindsight is 20/20 and the „Internet“ was a mainly US centered university research project that was thought of as a toy by the far majority.
Everybody thought they‘d have a replacement for the initial assignment once things got serious... for more fun, google ipv6 history ^^
And that in the interests of working within the physical hardware limitations of the day, very smart engineers made the best choices they could?
Because of such wasteful allocation we got this "wonderful" thing called NAT which basically killed most of innovation in area of networking and IPv6 which is taking over 20 years to adapt, because most ISPs hold to IPv4 as long as they can because making this switch requires some work.
Or, if my networking trainer at a Cisco course is to be believed, the geniuses that made IPv6 subnets contains 65k hosts at a minimum, when due to ARP requests all traffic would be dead at that scale.
IPv4 is a 32 bit address space, so it tops out around 4.2 billion total.
22.214.171.124/8 is locking down the first 8 bits, giving 2^(32-8) variable bits, or there are only 256 possible first octets and this is one so it’s 1/256th of 4.2 billion addresses.
BTW: this one is better: https://www.caida.org/research/id-consumption/census-map/ima...
It correlates strongly with big organizations who were "into" networking early on.
This tweet is overwhelmingly unlikely to be a breach of GDPR, because the controller (Julien Chaumond) has no ability to correlate an entire /8 range with a natural living person. Nobody is identifiable, there is no personal data, therefore the activity is not in scope.
My understanding is that it is only PII if it's in conjunction with other data in particular ways.
When the GDPR came along and IP Addresses were being mentioned as PII, my employer required us to sign a document stating (in part) that we wouldn't access, download or communicate PII data except when specifically authorised to do so.
When I refused to sign that and ran it up the flagpole with questions about how I'd do my job which occasionally included things like blocking IPs and dealing with network captures, it (apparently) went to lawyers who came back with a revised document to clarify that IPs wern't PII unless used in specific ways.
My point being - that specifically showing the 17/8 range as an aggregate shouldn't violate the GDPR any more than mentioning that it's Apple being the source of traffic.
Obviously different jurisdiction have a different notion of PII.
We agree on the conclusion though: a public range for a company probably doesn't count as PII.
Does it cite anything legally binding?
No? Then it's not a source.
>The GDPR states that IP addresses should be considered personal data as it enters the scope of ‘online identifiers’. Of course, in the case of a dynamic IP address – which is changed every time a person connects to a network – there has been some legitimate debate going on as to whether it can truly lead to the identification of a person or not. The conclusion is that the GDPR does consider it as such.
>.... The conclusion is that the GDPR does consider it as such.
The article just spews word vomit and then makes a conclusion on behalf of the GDPR without even citing a single bit of the GDPR to back up its arguments.
You should never take legal advise from a website that doesn't cite the legal text.
Either way Apple is not a Person with data rights in this scenario.
Article 4  states
> ‘personal data’ means any information relating to an identified or identifiable natural person (‘data subject’); an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person;
Can a person be identified by an IP, which is an online identifier? That's up to a judge to decide, according to context. Some judges do think so (French). The interesting part there: "Les adresses IP collectées [...] doivent par ailleurs être considérées comme une collecte à grande échelle de données d’infraction au sens de l’article 10 du RGPD". Rough translation : "Collected IP adresses [...] must be considered as large scale collection of offenses data in terms of article 10 of the GDPR" (article 10 is about personal data for offenses and convictions).
You can find a few similar cases where IPs are considered as personal data. It has been a subject of discussion here on HN several times during the GDPR introduction.
Please note that I never said or implied that any of what I said or linked is legal advice or a legal text.
If you want to have a business relationship with someone, pay them.
Apple obviously didn't.
Apple's ownership of 17.x is so well known it is described in an xkcd: https://xkcd.com/195/.
I mean I've worked at some of the riches companies in the world and I think the conversation would have gone like this
Me: hey project manager our access to server X where we get the really needed X1 resource has been blocked. Probably cause we are hammering their server (which we probably talked about some months ago)
Project manager: hmm a paid account will need to go through approval. Can you tell them we are trying to get a paid account, see if you can get them to give us temporary access, and maybe we can cache the result so we don't hit their server that often.
Me: ok I'll do that.
In fact I mean if I was on a small company providing a service that a really big company was abusing in such a way that I think they might like to pay for it - I think the smart thing would be to force them to contact me somehow because how do I even call the part of the big company I need to talk to.
I’m not sure I would feel comfortable doing that - stringing along a fellow engineer for the benefit of a freeloading PM.
When you look further down you find:
And that's just a quick search for s3 in the repo. It would not surprise me in the least to discover a `from_pretrained` that points at one of the s3 resources being pulled. There's probably other stuff like that as well in the code that could be causing equally nasty heartache .. especially if non-persistent containers are involved....
(This is a WAG aka Wild A Guess)
EDIT: Dug a little more and found:
Unless I'm mistaken here, there's a crap ton of code that could be downloading models at runtime ... Which seems significantly less than ideal ...
I'm working through how to best integrate apt-cacher-ng, sccache, and a pip cacher. It costs me nothing to hit apt or pypi, but like, somebody is paying that bill. A little perspective goes a long way.
I wonder if I could do something to just proxy all requests and cache those on a whitelist and stick it on my CI network.
Go ahead and try, see how far their "free" tier really goes.
Does anyone have any numbers on this?
Of course, you can trick Cloudflare into caching your large media assets using some funky Page Rules... but I wouldn't suggest it. Mostly just for moral reasons. If you have that much traffic, you should be making some money off it and then paying Cloudflare with it!
And a "cache everything" page rule tends to cache literally any file type, but it's not a great idea to push media files through CF due to the TOS prohibiting "disproportionate amounts of non-web content".
2mb might be referencing the limit for Workers KV.
FYI he is bragging, not complaining. There are a dozen ways to reduce or eliminate this problem.
You're an order of magnitude off.
45 TB per day is 1,350 TB in a month, or 1,350,000 GB.
Show me somewhere you can get a petabyte of egress inside a calendar month for 4 figures USD...
Let's suppose you even used the cheaper egress from Cloudfront rather than serving from S3 (lol @ your wallet if you serve 1 PB doing that).
The first petabyte costs an average of $0.045 per GB - $45,000.
The remaining 350 TB costs another $10,500 for a total of $55,500.
Serving off S3 directly? Yeah, that'll be more like $150,000.
Correct me if I'm wrong, but most colocation/dedicated server providers offer such prices. E.g. hetzner.com @ €1/TB, sprintdatacenter.pl @ €0.91/TB, dedicated.com @ $2/TB (or $600/month for an unmetered 1 Gbps connection).
But if you want S3/CDNs/<insert any cloud offering here>, then yeah, they're expensive.
BTW per Cloudflare ToS:
> Use of the Service for the storage or caching of video (unless purchased separately as a Paid Service) or a disproportionate percentage of pictures, audio files, or other non-HTML content, is prohibited.
Backblaze will send out data for 1 cent per GB, or $13,500. BunnyCDN apparently starts at 1 cent per GB in North America and Europe, and has a fewer-PoP plan that would be $5,550. And let's be fair, you don't need many points of presence for half-gig neural network models.
The other option is buying transit. Let's say we want the ability to deliver 80% of the 45TB in 4 hours as our peak rate. Then we need 20gbps of bandwidth. That's around $9,000 in a big US datacenter. Aim for a peak of 6/8 hours and it's $6,000/$4,500.
My guess is that they don't earn money yet. They more or less are just out of school so it's a young startup. Actually studied with some of them and I'm still a student.
By splitting it out you have no guarantee that the big guys will just use the free one.
Maybe a server farm, where each instance downloads a model when spinning up?
You would not believe how many people and how many libraries pull from primary sources directly instead of using local copies of common resources. I found this conversation because I'd just finished fixing that in our code and taking about 5 minutes off the build process.
Let me restate that: We were spending 5 minutes just downloading schema files. In an automated build. Every time, sometimes on several machines at once.
At one point we were trying to convince him that intentionally slowing all requests down by say 500 ms would get the attention of people who were misbehaving. Anyone who was downloading it once and caching would hardly notice the 500 ms. Those running it once per task would be forced to figure their shit out.
It's not an issue I really see, but it seems to me that CI infrastructure (Travis, ...) really should have caching proxies in place.
It's not the worst fate.
I'd say there's only a 10% chance Apple's firewall would let BitTorrent through, and only a 3% chance the CI servers would maintain a positive seed ratio.
Possibly it might solve the problem because users would cache the resources themselves to avoid the hassle of getting BitTorrent into their CI pipeline...
Wasabi's own docs mention a ballpark definition of "reasonable" as monthly transfer being less than total storage. Above that you'll get a call.
I'm sure they'd still be cheaper than AWS, but it's not going to be $5/TB/month to service the entire world.
You can still use a few dedicated servers on Hetzner though. They'll easily let you serve 200TB / month from each server for $100. It's obviously not comparable to proper CDN though, but it's does work for many companies.
Making objects in S3 publicly available and then complaining about their extortionate bandwidth charges is...silly.
S3 is very expensive, but averaging 4 gigabits 24x7 is not going to be cheap anywhere.
A normal upstream (if you were in a colo or some sort of DC) will charge you about $0.50 per meg. So, 4gbps would cost you... $200/mo. 
I think people don't realize just how much more expensive IaaS providers are. We're literally talking orders of magnitude.
The big cloud providers are very expensive, but you're getting impeccable bandwidth. Low packet loss, burst what you want even in peak times, fast transit worldwide, minimal variability.
What most people actually don't realize is that they're making that tradeoff implicitly. You're not being ripped off if you need all of these things, but if it's overkill you can get much cheaper bandwidth elsewhere.
But it certainly works if you need to CDN a bunch of large files.
Page rules are free if you only need a few, though.
For a commercial company paying $5-10 a month to save TBs of bandwidth via Backblaze & Cloudflare is a no brainer. It's just unfortunately not very workable for small projects. Github Pages works with Cloudflare for static sites just fine though.
Or perhaps you have reached out to them, but are for some reason still complaining on Twitter to drum up PR or something?
Regardless, this posting is ridiculously context-free to the point of being click-baity. (But hey, good job, I clicked on it anyway.)
And then people wonder why people don't want to make their stuff free/open source. Even when it's free people still think you're somehow entitled and overcharging.
> Even when it's free people still think you're somehow entitled and overcharging.
I just explicitly advocated for the opposite of that, so I'm not sure where you're getting that.