Is there any benefit to using coreOS when you don't need a million machines? How much work is it to start with, for example, if you have no idea what your scaling needs will be in the future?
No work at all; basic coreos is just docker managed by systemd. Can you make your application docker-izable? Were you planning on using process management? If you answered yes to both of these questions... :)
Further, coreos tries to make you write applications in a 12-factor-y way, so when the time does come for you to use a million machines, you won't need to make huge adjustments to your deployments (just plug the container and init script into fleet and let it roll).
The main benefit is it's an easy to update, minimal OS. And docker's already set up for you and updated regularly.
Our cloud-config implementation is designed so that you can use the same config to set up a new machine to match an existing one, and have it automatically join the cluster. Start with 3 machines and scale up as you need.
Learning systemd is one bump you'll have to get over, but all of the major distros will be using it, so they're is no better time to try it out.
I'm interested in it because I plan to use docker to ease application management, and with CoreOS I barely have to maintain the operating system because it's so lightweight. So it greatly simplifies things :)
If you think docker is a good fit for your application, this is kind of the next step.
The only significant benefit over traditional operating systems/network services is it's designed to work on flaky hardware and networks. The only use cases i've found for decentralized distributed networks of application services (other than obscure stuff like parallel processing of large datasets) is when you have no guarantees of availability. As far as "scaling", you don't need coreOS to build a scalable network (and i've seen no performance benchmarks of coreOS running on thousands of machines at a time, so I have no idea how well it scales)
Run from root partition A and only update on partition B, essentially via chroot. Downtime required for updates is the cost of a reboot. If, however, upon an error during boot, reboot into other partition which will be your last known good config.
Seems simple and straight-forward but is difficult for unmodified Debian-based distributions, so they've addressed this as a key feature of CoreOS.
I know that there are lots of heavyweight folks who swear by Zookeeper as both a reliable and powerful tool, and for good reason. Unfortunately the docs can be fairly inscrutable, even for experts, and it typically requires the maintenance of a separate cluster of Zookeeper nodes.
So I like that etcd is a fundamental component of CoreOS, with these features:
1. Written from scratch in Go
2. Implements the Raft protocol for node coordination
3. Has a useful command line app
4. Runs on every node in a cluster
5. Enables auto-discovery
6. Allows CoreOS nodes to share state
No offense, I just don't get it: Why is 1) a feature for you? Everything else on the list kinda makes sense (I understand that this describes something I'd call a feature), but 'written from scratch' or 'in Go'?
I guess what I meant by "from scratch" is that they aren't burdened by legacy code, and aren't limited to using the Paxos algorithm.
If you look at Deis, for example, it basically outsources a lot of node management to Chef Server, which in my view creates a great deal of technical debt on day one.
You read negative connotations into 'excites'. That wasn't intended.
I was just curious, since 'from scratch' can just as well mean 'untested' although I certainly agree that it sometimes is the Right Thing. The reference to Go was another thing that threw me off, since I rarely (admittedly .. sometimes) judge software projects by the language it is written in.
Thank you for the answer and some more references.
Probably the fact that you get a single binary (unlike interpreted languages), and that it isn't a mix of Go and C meaning I worries about c libraries.
Personally, my preference for zookeeper comes from the API. To me, the ZK API and docs were far more understandable than the etcd ones. The etcd api docs appear to be a collection of examples, not reference docs. It does a poor job of explaining the possible operations and what various options will do, particularly in what combinations of options are allowed.
In fact, I had to resort to running test queries against a running etcd server just to work out the proper semantics of some of the arguments.
The functionality to handle this is mentioned in the blog post: "standby" peer mode.
"Our upcoming release, etcd 0.4, adds a new feature, standby mode. It is an important first step in allowing etcd clusters to grow beyond just the peers that participate in consensus."
Fair enough, it just sounded like the GP was describing something inherent rather than something new, and it didn't mesh with my understanding of how etcd worked to date.
As you (perhaps automatically) expand and collapse the cluster, you'll need to make sure to communicate to all nodes what the new cluster size is. If some nodes don't know the correct quorum count, split-brain!
Also, coordination services are typically critical, so its important to isolate from to the bugs in the adhoc code you're writing for your web tier, a crazy query in your database, etc.
It's much easier and safer in practice to just have 3 or 5 nodes running the coordination in isolation.
Edit: more reasons -- It's easier to deploy a coordination service to 5 nodes than 500. It's easier to debug 5 nodes than 500.
I probably should have said that it "can" run on any node. Yes, currently it does run on every node, but their roadmap doesn't have the requirement that every node be actively participating in elections.
I'm sure that you have seen fleets of dedicated Zookeeper nodes. I rather like that etcd is simply a service that can run on any node, and does not require a separate role-specific fleet of servers just to do coordination. That was the point I was attempting to make.
While similar, Serf and Etcd solve a different problem. Etcd is strong consistent (all nodes will see the same data, however a partition may cause the system to not accept writes) while Serf is eventually consistent (all nodes are not guaranteed to see the same data, however the system will always accept writes)
So somethings like a distributed lock is impossible to implement with Serf.