Hacker News new | past | comments | ask | show | jobs | submit login

I thought I'd answer some of your questions, as the person that pays the bill.

1. This can be cheaper on AWS. We've been meaning to move to reserve instances, paying a year at a time, for a while and simply haven't done it yet.

2. Fastly has already donate CDN usage to us, but we haven't fully utilized it yet as we're (slowly) sort out some issues between primary gem serving and the bundler APIs.

3. RubyCentral pays the bill and can afford to do so via the proceeds generated from RubyConf and RailsConf.

4. The administration is an all volunteer (myself included) effort. Because of that, paying a premium to use AWS has it's advantages because it allows more volunteers have help out given the well traveled platform. In the past, RubyGems was hosted on dedicated hardware within Rackspace. While this was certainly cheaper, it created administrative issues. Granted those can be solved without using AWS, but we get back to again desiring to have as low of friction on the administration as possible.

Any other questions?

> In the past, RubyGems was hosted on dedicated hardware within Rackspace. While this was certainly cheaper, it created administrative issues. Granted those can be solved without using AWS, but we get back to again desiring to have as low of friction on the administration as possible.

If Rackspace can be of assistance in the future, feel free to reach out (brian.curtin@rackspace.com). We currently donate hosting to many open source projects, including ones in a similar space, like the Python Package Index.

Thanks! I'll bring it up with the team.

Note that if you can get Rackspace or whomever to donate the hardware/bandwidth, you can use less than 7k/month to hire a very competent admin to solve the administrative issues, which would probably lead to better service for everybody.

On that note, you might check out the Open Source Lab at Oregon State University. They host many projects: http://osuosl.org/communities

Hey Evan, as with Rubyforge the last 7-odd years, you'd be welcome to a free account on Bytemark's UK cloud platform bigv.io, or dedicated servers, or a mix on the same VLAN. We're a Ruby shop ourselves, and we host a fair chunk of Debian in our data centre too these days (https://www.debian.org/News/2013/20130404). I So just drop me a line if that's of interest <matthew@bytemark.co.uk>.

I assume this was posted because it's an enormous bill :) but obviously if you're happy with it, carry on!

Did you consider using a mirror network, with servers run by external organizations, instead of going with AWS bandwidth for rubygems? Seems like that would be a good approach for the static/bulk part of your dataset, and there are lots of companies and universities who are set up to serve software. (The mirror I manage serves about 50 TB/month for several Linux distros, and many sites are larger.) Do the work and infrastructure required to manage these networks make them not worthwhile?

Edit: Found a post [0] calling for a rubygems mirror network. Otherwise there is lots of information about setting up local mirrors of the repository.

[0] http://binarymentalist.com/post/1314642927/proposal-we-have-...

It's been discussed many times before, yes. Rubygems usage pattern by our users make any kind of mirror delay unacceptable. We currently run a number of mirrors, configured as caching proxies. I want to get us going on a CDN like Fastly soon because they provide effectively the same functionality but distributed to many, many more POPs that I will ever setup.

I suspect mirror delay is less of an issue than you might perceive it to be. Many CPAN mirrors manage to stay within tens of seconds/no more than a minute from the main CPAN mirror that PAUSE publishes to.

If it's just the sync delay, you could track each mirror's last-updated time and only direct users to a mirror that had synchronized with the master since the package in question was released. Otherwise, serve the content from AWS. Though I'm sure this couldn't beat the service that Fastly's donating.

The caching mirror configuration achieves nearly the same thing. In the past, people have wanted to run their own mirrors that we directed people to, but that's got reliability and security issues.

Mirrors shouldn't be a security concern, the signatures of packages should come from "headquarters", same goes for reliability, clients should be able to, and SHOULD pull from multiple sites simultaneously.

Even if package signing works perfectly, when I connect to a mirror and request a patch for foo, the mirror learns my IP address and the fact I have an as-yet-unpatched version of foo.

Very true on the signatures. Using multiple sites isn't necessary though, imho.

I could be wrong, but it seems like a nice hack to pull for say 3 mirrors at the same time at some offset into the resource using a range get for say, 16k each. The first one to complete does a pipelined request for another 16k slot and this process continues until the entire asset is downloaded. The fast mirrors would dominate, a small percentage of the bandwidth from slow mirrors would assist and truly slow mirrors would be ignored.

It would be really interesting to see the bandwidth broken down by gem - I suspect rails would be at the top, but it'd be interesting to see.

If most of the installs are on servers, have you considered talking to server providers about setting up internal mirrors on their networks? That might save everyone a lot of bandwidth.

Of course, people shouldn't really be installing their gems from ruby gems on servers anyway, is there any way to prod bundler to make it default to package gems and do a local install where possible, rather than downloading them every time there is a deploy (the current default)? At present you use double bandwidth from people downloading once on their local machine, and once on their server to update.

Fetching the ruby gems index with bundler/rubygems still takes a while every time I bundle update, have you looked at optimising that part of the process further (at least it doesn't fetch a list of all gems now, but it still fetches a list of all versions of each gem doesn't it?), say caching older gem results? The list of gem versions available should not change for old ones, so you should really only need to fetch a very small list of latest versions. The memory usage and bandwidth usage is still quite high there.

Hey, a chance to plug my thing!

I built S3stat (https://www.s3stat.com/) to fix this opaqueness that comes with using Cloudfront as a CDN and get you at least back to the level of analytics you'd get if you were hosting files from one of your own servers.

RubyGems guys, if you have logging set up already, I'd be happy to run reports for all your old logs (gratis, naturally) so you can get a better idea of which files (and as another commenter wondered about, which sources) are costing you the most.

Off topic: S3stat is our go to service we've been using it for years and really couldn't survive without it as its how we charge our clients!

I still don't understand how AWS was preferable to a dedicated server host. Could you elaborate on that?

Virtualization allows us to spin up new instances and migrate traffic to them. This means we can work entirely from chef and keep things clean. This is important for our volunteers to have a complete picture of an instance and to be able to make new ones.

You can easily do that on dedicated hardware too. We run all our stuff in vm's and containers even on the office dev servers 3 meters behind my desk.

And pretty much "all" dedicated server providers these days also have cloud offerings if you need to spin up some instances quickly to handle traffic spikes etc., or for dev/testing purposes.

Do you know who are the biggest consumers of bandwidth? I would guess the CI servers (Travis, Circle)

I think that bandwidth consumed by Circle should be free, since we're also hosted in AWS. Maybe somebody who knows more about the details of Amazon's billing can confirm/deny.

bandwidth is free in the same region, but not across regions.

*edit and I believe not if you end up using the public IP address instead of the internal ip address.

if you use the ec2 public dns it will resolve to an internal ip when the request comes from within ec2

A very good question. I'll see about crunching some of the logs to break it down by subnet.

Great. Whoever the major commercial users are have a financial incentive to keep the service performant. They should all at least be sponsors at some level if they're not already.

Here is a partial log, every /24 that had more than 10k hits in the last 24 hours: https://gist.github.com/evanphx/9361755

Top 5 are hosting providers. Makes sense.

isn't bluebox where travis is hosted?

Hey, as I mentioned in another part of this thread, my startup crunches those logs for a living (and they're sadly not really designed for crunching by anything that comes off the shelf). Ping me if you'd like a hand doing the crunching.

How have the costs changed in the last year or so? It would be cool to see a month-over-month graph.

I'll put that on my todo list.

No question. Though I'll use the occasion to thank you for all the dedication, financial commitment and awesome software you've provided us with in the Ruby community.

Need help with the VCL with Fastly? Drop me a line to my name minus ct @npmjs.com.

Thanks! I'll definitely keep you in mind as we're (finally) getting around to setting up correctly.

Same here: we host Maven Central with fastly and are willing to help out any way we can. @sonatype.com

Have you looked into the Rackspace Cloud offering?

Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact