Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Help Scale NPM (scalenpm.org)
125 points by jenius on Nov 26, 2013 | hide | past | favorite | 60 comments


I'll be that guy; I'm a little confused as to why this would require $200,000 (the requested total funding requested) to solve. From the site itself :

614,680,691 requests per month come down to ~230 request per second. Allowing for some spikiness that boils down to perhaps 1k request/second at peak. Requests in these cases are mostly relatively simple queries on version-ed, highly cacheable data. I say highly cacheable because it is relatively static data for which most (if not all) of the data fields relevant for these requests can fit in memory of perhaps even a single node (NPM currently includes 48,799 packages. That leaves a very healthy chunk of data per package on 16Gb-128Gb RAM server boxes).

The downloads are a bit of a puzzle to me as well. On my machine the average NPM package is about 200Kb (YMMV). 114,626,717 downloads are mentioned on the site. 200k times 114 million downloads lands us on roughly 23 TB. Even on a relatively expensive CDN such as Amazon CloudFront the total monthly cost for that bandwidth and content request load for CloudFront and the required S3 costs land on about $3k/month and that's ignoring all bulk discounts, reserved capacity and so on (which are very significant at these volumes).

I'm more than likely oversimplifying a few things here and there (or failed horribly at math) but I'd still be very interested to hear why this requires such a large investment. Also, wouldn't the more obvious solution be to open source the npmjs software and allow the community to contribute knowledge and time instead?

EDIT: Quickly wanted to point out that I use npmjs.org often , is a great service and that donations are very well deserved. After re-reading my post it turned out more negative sounding than intended.


Every fuzzy-versioned dependency means one request to npm that you can not (really) cache - at least not without doing and relying on active cache invalidation. And if I look at my average npm install log, that's about half of the requests. For storage estimates you should also include that it's not only storing the packages but also all versions of the package. ~A year ago the registry was 25GB in size IIRC. And that was a freshly downloaded & compacted one on my machine. Especially considering that the growth will not suddenly stop, things get complicated.

Also: both the website and the registry is already completely open source. But hosting it with "perfect" uptime is a real problem and requires not only network/hardware but also people. Including testing and migrating to a new, more scalable solution - say a handful of people (2-4) will work on that for a month, which does not yet include future maintenance. That can easily mean $50000 just in salaries - or in "people that would normally bring value to paying customers" if you assume that those people will be fine with working that month without pay.


I see what you mean but allow me to theorycraft a bit more : Changes and removal of packages are considerably more rare than additions and either happens relatively infrequently. This makes cluster wide cache invalidation relatively trivial (it's easy but has scalability issues, and scalability won't be an issue here). Also, when I said "cached" I probably should have said keeping various indexes in memory to facilitate queries. I actually work with systems in a roughly similar technical domain (way different space though, I work on large scale TV systems).

Your other point is definitely the big challenge but that is exactly the main motivation to use hosted CDNs and other hosted services that have solved this challenge for you o a large extent.

I'm almost tempted to have a go at this.


Instead of resolving those requests on the server, why not have the client download an index file with all the available versions for every package, or maybe just an index file with the top 20% of packages responsible for 80% of the load.


It sounded like they might hire against it in their description. That's not much compared to a potential salary over a few years.


Wouldn't it be better to make NPM more distributed so that anyone could set up a mirror and help out?

EDIT: Not saying it would be easy; I'm just wondering if you've considered this direction.


Anyone can make a mirror. That's the glory of CouchDB. Just kick off replication and BOOM you've got the npm registry. There are community mirrors in Europe (http://npmjs.eu) and Australia.

If you want to run or use a community mirror that's totally great!

npm config set registry http://my.awesome.community.mirror


This is great and all, but when Heroku relies on the main npm registry to be up (when it's not) you're screwed.


Heroku customers can always clone the nodejs buildpack and simply point the npm registry at another mirror.


I've started to always check in my node_modules directory. Heroku automatically runs `npm rebuild` in that case, so native modules always work. I've found it to be both faster and more reliable.


I'm surprised they're using CouchDB and not MongoDB.


npm is pretty much the prime use-case for CouchDB. REST API out of the box, replication is a core feature and not just a scaling feature (multi-master, MVCC) and the validation/access control is pretty much made for it. The npm registry is implemented as a CouchApp for a reason.


I agree that from a functionality perspective this is largely true, at least on paper. That said, if running costs become a significant bottleneck CouchDB becomes a less obvious choice. Sometimes there's a reluctance to migrate from one technology to another as requirements change over time but this seems one of those occasions where exactly that step is required. Sometimes it's good to take a step back, look at your current requirements in terms of cost and performance and determine what technology best suits your needs. I would question the know-how and objectivity of anyone that would land on CouchDB/node.js in this instance.


There's no node involved in the registry itself. This is purely CouchDB.


CouchDB has better replication support (e.g. multi-master, MVCC).


a lot of the central node community started on couch


Why?


It's hard but the payoff is worth it. Just look at CPAN or any linux pacakge distribution system. The package hosting is all mirrored.



This was what I said when the repository went down for over eight hours a few weeks back. Part one is scaling the server, part two is making the client aware of the mirrors and gracefully handling temporary unavailability. From the looks of it, this all seems above the heads of node's current leadership which frankly sours me on relying on node at all.


Make it into a docker app please. I would love to be able to set up a bunch of these commonly used registries as a distributed app I can run on my machine with very little overhead. Have it set up to pull in bulk changes as a cron job or via torrents maybe?

Having my own personal NPM and own personal RubyGems would be awesome.


Your own personal rubygems is downright trivial, given a constrained set of gems. Upload .gems to a directory, build an index, serve static files. I'm guessing there's something about npm packages which makes that impossible.


Yes, npm itself is the only reason why.


https://github.com/jmgunn87/mynpm <- docker app is jmgunn87/mynpm


Would also be interested to hear someone from nodejitsu's response to this.


I'm a big fan of npm but there are unanswered questions here.

1. Why $200,000? Can we get a rough budget so we can understand how it will be used and how long it will last?

2. We should all be thankful for the time and resources Nodejitsu/Joyant/IrisCouch puts into node and npm. That said, wouldn't the projects be better off separated from these businesses with their own funding? If we were donating money to the projects instead of a for profit corp we would have more certainty of how and when the money will be used. "Donating" to Nodejitsu just adds to their bottom line and in reality could be used however they want. If something happens to the business we have no guarantees the money would continue to be used for npm.


This is a bit confusing. Am I right in asserting the following?...

Commercial PaaS hosting firm, nodejitsu, is asking for donations to pay (or help to pay) for the costs of running npm.

Nodejitsu plan on using said funds to purchase additional resources at Joyent, where npm is currently hosted.

Joyent own the trademark for Node.js


But I thought node was web scale.

Edit: Yep that was a lame joke. Anyhow, take my money, I love NPM and use it daily.


Well, benchmarks show it's not more webscale than an plain old JEE app , or a go app or even javascript on the JVM ( http://www.techempower.com/benchmarks/ ). By the way java 8 comes with a new js engine i believe,might be interesting to see if node get ported to the jvm with it.

Yes, it will be able to serve more concurrent requests than your typical python/ruby/php app.

But npm doesnt even seem to run on node, looks like it is a couch app. I dont know how it performs.


I'd love to donate. But as most Germans I don't own a credit card. Why do so many people ignore that credits cards are not the default payment methods in some countries. I'd even accept to pay the extra fees for using PayPal.


Because there are very few ways of receiving payments from multiple source types reliably. PayPal is not reliable.


This. It's also notorious for freezing assets from crowdfunding.


I don't think giving up money for more servers and hosting is really the answer here. I think de-centralizing and distributing the registry is really the way forward here. there is one project i know that is trying to make this happen https://github.com/jmgunn87/mynpm


Appears to be running slowly. Maybe we need a scalescalenpmdotorg.org?


It's down now :)


From the JS comments - made me laugh

  /**
   * Simple counter magic to make people engaged.
   *
   * @constructor
   */


Glad you liked it! The counting is derived from the last week of npm downloads uniformly distributed over time. http://npmjs.org


I always feel guilty about how much I end up downloading from the npm registry. I keep my nodejs projects is separate dirs, so I end up downloading the same dependencies over and over again each time I start a new project.

I wish the --global install switch was cleverer and allowed you to have multiple versions of the same package installed at the same time. Then I could just symlink everything together which would save them bandwidth (and save me diskspace).


npm actually caches packages locally, so don't feel too bad. Still if you have wildcard dependencies it will hit the main npm server to check if you have the most recent version in your cache.


Better ideas than this:

* Offered paid, private registry that doesn't cost an insane amount of money. Somehow host it on the same metal as the public repo.

* Decentralize. Make it easier to setup mirrors or proxy/cache layers. If I had a simple to deploy npm caching proxy that didn't need to replicate every upstream package, only the ones that I use, it would reduce load upstream and protect me when upstream fails. ++ if I can host private packages there as well.


They should take Bitcoin for additional exposure


To be honest, the "Card number" and "Security code" on the form struck me as a little weird.


I wonder why they can't/don't make use of a CDN to scale downloads. Unless they do already and I'm not aware.


CDNs cost a lot of money.


Certainly less than reinventing the wheel. You have to include development cost, maintenance and so forth that is not needed or significantly reduced on hosted content delivery services. See my post in this thread for details.


They can at the scale of the NPM registry, but I don't imagine some of the lower-end CDNs would be that much more expensive than scaling up the backend to meet demand.


Countless hours have been saved by NPM. I would have donated a bit more if it let me input the exact amount.


Considering that CouchDB was built to do multi master replication, it's just a matter of adding more servers and setting up automatic replication.

Also, is the current setup using any kind of front end caching like Varnish?


"built to do multi master replication" is a pretty naive way of putting it. Couch does not (by default) support a multi-master cluster setup. The way in which couchdb supports mutli-master is "you have different servers that sync data and conflicts are resolved on application level". And I wrote "data" for a reason because there's stuff you need to sync yourself if you want to have multiple couchdb servers appear as one to the outside.


True - multi-master in CouchDB means that you can have two CouchDB instances (or, indeed, anything which speaks the CouchDB replication protocol - see http://www.replication.io/) which can be synced and, in the event of a network partition, both instances can be written to. One of those masters could be CouchDB and one could be PouchDB in a browser or TouchDB on a mobile phone. Neither instance requires any coordination from the other.

Clustered CouchDB (BigCouch - on it's way to being integrated into CouchDB vNext) relies on the same MVCC semantics - it's just using Erlang rather than HTTP to transfer documents between clusters and attempts to keep the nodes in sync continuously.

Conflict resolution is tricky but CouchDB plays it safe and keeps all conflicting revisions of each document around until you resolve them to ensure no data is lost. It's pretty easy to find and fetch any conflicted documents so they can be resolved in an application-specific way.


Regarding the banners, you say that the banner will be on the scalenpm.org site, anyway to get on the npmjs.org site or somewhere else with greater visibility?


Could npm use torrent somehow? Reducing the load from the main servers? This would require users to opt-in and become a peer... just a random thought.


Out of curiosity, how do they normally pay the bills?


They mention somewhere on the site that they sell a private version of npm that they are working on, and also nodejitsu (the company behind npm) also runs a node hosting platform: https://www.nodejitsu.com/


I'd imagine they've been footing the bill, and that it's become unsustainable.


I would only donate if they dump Nodejitsu.


So... Will I receive an email or something to confirm that my donation went through?


I want to give $50, can I just get a t-shirt?


So now we live in a world where people extort you to fix their broken service. The first repository is always free...


This website is not scaling well.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: