I'll be that guy; I'm a little confused as to why this would require $200,000 (the requested total funding requested) to solve. From the site itself :
614,680,691 requests per month come down to ~230 request per second. Allowing for some spikiness that boils down to perhaps 1k request/second at peak. Requests in these cases are mostly relatively simple queries on version-ed, highly cacheable data. I say highly cacheable because it is relatively static data for which most (if not all) of the data fields relevant for these requests can fit in memory of perhaps even a single node (NPM currently includes 48,799 packages. That leaves a very healthy chunk of data per package on 16Gb-128Gb RAM server boxes).
The downloads are a bit of a puzzle to me as well. On my machine the average NPM package is about 200Kb (YMMV). 114,626,717 downloads are mentioned on the site. 200k times 114 million downloads lands us on roughly 23 TB. Even on a relatively expensive CDN such as Amazon CloudFront the total monthly cost for that bandwidth and content request load for CloudFront and the required S3 costs land on about $3k/month and that's ignoring all bulk discounts, reserved capacity and so on (which are very significant at these volumes).
I'm more than likely oversimplifying a few things here and there (or failed horribly at math) but I'd still be very interested to hear why this requires such a large investment. Also, wouldn't the more obvious solution be to open source the npmjs software and allow the community to contribute knowledge and time instead?
EDIT: Quickly wanted to point out that I use npmjs.org often , is a great service and that donations are very well deserved. After re-reading my post it turned out more negative sounding than intended.
Every fuzzy-versioned dependency means one request to npm that you can not (really) cache - at least not without doing and relying on active cache invalidation. And if I look at my average npm install log, that's about half of the requests. For storage estimates you should also include that it's not only storing the packages but also all versions of the package. ~A year ago the registry was 25GB in size IIRC. And that was a freshly downloaded & compacted one on my machine. Especially considering that the growth will not suddenly stop, things get complicated.
Also: both the website and the registry is already completely open source. But hosting it with "perfect" uptime is a real problem and requires not only network/hardware but also people. Including testing and migrating to a new, more scalable solution - say a handful of people (2-4) will work on that for a month, which does not yet include future maintenance. That can easily mean $50000 just in salaries - or in "people that would normally bring value to paying customers" if you assume that those people will be fine with working that month without pay.
I see what you mean but allow me to theorycraft a bit more : Changes and removal of packages are considerably more rare than additions and either happens relatively infrequently. This makes cluster wide cache invalidation relatively trivial (it's easy but has scalability issues, and scalability won't be an issue here). Also, when I said "cached" I probably should have said keeping various indexes in memory to facilitate queries. I actually work with systems in a roughly similar technical domain (way different space though, I work on large scale TV systems).
Your other point is definitely the big challenge but that is exactly the main motivation to use hosted CDNs and other hosted services that have solved this challenge for you o a large extent.
Instead of resolving those requests on the server, why not have the client download an index file with all the available versions for every package, or maybe just an index file with the top 20% of packages responsible for 80% of the load.
Anyone can make a mirror. That's the glory of CouchDB. Just kick off replication and BOOM you've got the npm registry. There are community mirrors in Europe (http://npmjs.eu) and Australia.
If you want to run or use a community mirror that's totally great!
I've started to always check in my node_modules directory. Heroku automatically runs `npm rebuild` in that case, so native modules always work. I've found it to be both faster and more reliable.
npm is pretty much the prime use-case for CouchDB.
REST API out of the box, replication is a core feature and not just a scaling feature (multi-master, MVCC) and the validation/access control is pretty much made for it.
The npm registry is implemented as a CouchApp for a reason.
I agree that from a functionality perspective this is largely true, at least on paper. That said, if running costs become a significant bottleneck CouchDB becomes a less obvious choice. Sometimes there's a reluctance to migrate from one technology to another as requirements change over time but this seems one of those occasions where exactly that step is required. Sometimes it's good to take a step back, look at your current requirements in terms of cost and performance and determine what technology best suits your needs. I would question the know-how and objectivity of anyone that would land on CouchDB/node.js in this instance.
This was what I said when the repository went down for over eight hours a few weeks back. Part one is scaling the server, part two is making the client aware of the mirrors and gracefully handling temporary unavailability. From the looks of it, this all seems above the heads of node's current leadership which frankly sours me on relying on node at all.
Make it into a docker app please. I would love to be able to set up a bunch of these commonly used registries as a distributed app I can run on my machine with very little overhead. Have it set up to pull in bulk changes as a cron job or via torrents maybe?
Having my own personal NPM and own personal RubyGems would be awesome.
Your own personal rubygems is downright trivial, given a constrained set of gems. Upload .gems to a directory, build an index, serve static files. I'm guessing there's something about npm packages which makes that impossible.
I'm a big fan of npm but there are unanswered questions here.
1. Why $200,000? Can we get a rough budget so we can understand how it will be used and how long it will last?
2. We should all be thankful for the time and resources Nodejitsu/Joyant/IrisCouch puts into node and npm. That said, wouldn't the projects be better off separated from these businesses with their own funding? If we were donating money to the projects instead of a for profit corp we would have more certainty of how and when the money will be used. "Donating" to Nodejitsu just adds to their bottom line and in reality could be used however they want. If something happens to the business we have no guarantees the money would continue to be used for npm.
Well, benchmarks show it's not more webscale than an plain old JEE app , or a go app or even javascript on the JVM ( http://www.techempower.com/benchmarks/ ). By the way java 8 comes with a new js engine i believe,might be interesting to see if node get ported to the jvm with it.
Yes, it will be able to serve more concurrent requests than your typical python/ruby/php app.
But npm doesnt even seem to run on node, looks like it is a couch app. I dont know how it performs.
I'd love to donate. But as most Germans I don't own a credit card. Why do so many people ignore that credits cards are not the default payment methods in some countries. I'd even accept to pay the extra fees for using PayPal.
I don't think giving up money for more servers and hosting is really the answer here. I think de-centralizing and distributing the registry is really the way forward here. there is one project i know that is trying to make this happen https://github.com/jmgunn87/mynpm
I always feel guilty about how much I end up downloading from the npm registry. I keep my nodejs projects is separate dirs, so I end up downloading the same dependencies over and over again each time I start a new project.
I wish the --global install switch was cleverer and allowed you to have multiple versions of the same package installed at the same time. Then I could just symlink everything together which would save them bandwidth (and save me diskspace).
npm actually caches packages locally, so don't feel too bad. Still if you have wildcard dependencies it will hit the main npm server to check if you have the most recent version in your cache.
* Offered paid, private registry that doesn't cost an insane amount of money. Somehow host it on the same metal as the public repo.
* Decentralize. Make it easier to setup mirrors or proxy/cache layers. If I had a simple to deploy npm caching proxy that didn't need to replicate every upstream package, only the ones that I use, it would reduce load upstream and protect me when upstream fails. ++ if I can host private packages there as well.
Certainly less than reinventing the wheel. You have to include development cost, maintenance and so forth that is not needed or significantly reduced on hosted content delivery services. See my post in this thread for details.
They can at the scale of the NPM registry, but I don't imagine some of the lower-end CDNs would be that much more expensive than scaling up the backend to meet demand.
"built to do multi master replication" is a pretty naive way of putting it. Couch does not (by default) support a multi-master cluster setup. The way in which couchdb supports mutli-master is "you have different servers that sync data and conflicts are resolved on application level". And I wrote "data" for a reason because there's stuff you need to sync yourself if you want to have multiple couchdb servers appear as one to the outside.
True - multi-master in CouchDB means that you can have two CouchDB instances (or, indeed, anything which speaks the CouchDB replication protocol - see http://www.replication.io/) which can be synced and, in the event of a network partition, both instances can be written to. One of those masters could be CouchDB and one could be PouchDB in a browser or TouchDB on a mobile phone. Neither instance requires any coordination from the other.
Clustered CouchDB (BigCouch - on it's way to being integrated into CouchDB vNext) relies on the same MVCC semantics - it's just using Erlang rather than HTTP to transfer documents between clusters and attempts to keep the nodes in sync continuously.
Conflict resolution is tricky but CouchDB plays it safe and keeps all conflicting revisions of each document around until you resolve them to ensure no data is lost. It's pretty easy to find and fetch any conflicted documents so they can be resolved in an application-specific way.
Regarding the banners, you say that the banner will be on the scalenpm.org site, anyway to get on the npmjs.org site or somewhere else with greater visibility?
They mention somewhere on the site that they sell a private version of npm that they are working on, and also nodejitsu (the company behind npm) also runs a node hosting platform: https://www.nodejitsu.com/
614,680,691 requests per month come down to ~230 request per second. Allowing for some spikiness that boils down to perhaps 1k request/second at peak. Requests in these cases are mostly relatively simple queries on version-ed, highly cacheable data. I say highly cacheable because it is relatively static data for which most (if not all) of the data fields relevant for these requests can fit in memory of perhaps even a single node (NPM currently includes 48,799 packages. That leaves a very healthy chunk of data per package on 16Gb-128Gb RAM server boxes).
The downloads are a bit of a puzzle to me as well. On my machine the average NPM package is about 200Kb (YMMV). 114,626,717 downloads are mentioned on the site. 200k times 114 million downloads lands us on roughly 23 TB. Even on a relatively expensive CDN such as Amazon CloudFront the total monthly cost for that bandwidth and content request load for CloudFront and the required S3 costs land on about $3k/month and that's ignoring all bulk discounts, reserved capacity and so on (which are very significant at these volumes).
I'm more than likely oversimplifying a few things here and there (or failed horribly at math) but I'd still be very interested to hear why this requires such a large investment. Also, wouldn't the more obvious solution be to open source the npmjs software and allow the community to contribute knowledge and time instead?
EDIT: Quickly wanted to point out that I use npmjs.org often , is a great service and that donations are very well deserved. After re-reading my post it turned out more negative sounding than intended.