The available documentation for heartbeat, corosync and pacemaker can be disappointing. I found the best way to learn is by doing–if you can afford it, cobble together a lab environment and see how many different ways you can break a system. With pacemaker, I find this depends highly on the resources you are managing!
For storage replication, Linux has the excellent DRBD (http://www.drbd.org/) software to replicate disk at the block device level. This is great because any kind of disk based systems can be supported, such as database server, mail server, file server, DNS server, etc.
For failure detection, Linux has the Linux HA Heartbeat (
http://www.linux-ha.org/wiki/Heartbeat). This would detect failure at machine level and ensure proper failover.
Within a machine, there are other tools to monitor process level failure and propagate the failure to Linux HA Heartbeat.
BTW, STONITH is a super simple way to avoid the partition problem.
In a two-machine cluster, you configure two network interfaces for each machine. One interface has the real IP of the machine, serving as the private IP. Another interface has the virtual IP, serving as the public IP which all other clients connect to. Configure the second machine the same way with the same virtual IP but has it disabled.
When Heartbeat detects a failure on the primary machine, it would make the standby machine as primary and enable its virtual IP interface. The ARP service on the second machine broadcasts to all machines in the subnet to claim the virtual IP address as its own.
It's a pretty simple process.
Note that ucarp and most (all?) virtual IP tools rely on being able to send fake ARP to update the IP, which puts restrictions on where the servers can live on the network, switch configuration etc.
edit: If what you're using it for is HTTP traffic, HAproxy seems to be the most mature tool out there for HTTP load balancing and failover.
HAproxy is a reverse proxy, it can do load balancing, but it's not useful if you need a HA cluster, as HAproxy becomes the single point of failure in your architecture. You still need something like wackamole, vippy or ucarp to activate the switch from a failed machine to a standby box.
For high-availability HAProxy I've found ECMP to be by far the most simple way to achieve very reliable redundancy, with a side benefit of more or less infinite horizontal scaling (depending on what routers you're talking to). I've served hundreds of gigabits with this model, and it works well. You can even scale it out on a global level by utilizing anycasting.
It works pretty simple. Run bgpd on your HAProxy boxes talking to your router. Each HAProxy box advertises your VIP, and your router will load balance this via ECMP. Should a HAProxy machine die, the BGP announcement gets withdrawn and traffic flows to the remaining proxy servers still advertising the VIP.
The only thing left to do is get some sort of service monitoring going that can automatically down bgpd should haproxy die/otherwise mess up on an individual machine. Add a couple checks in to ensure there is always a "path of last resort" should you have a bug in your app or monitoring code, and this proves to be very resilient, scalable, and is something nearly anyone can troubleshoot in a very short amount of time. It also works well in a cloud type on-demand/devops model - as it's extremely easy to simply spin up additional haproxy machines and have them automatically announce their configuration via bgpd.