We auto-scale EC2, and randomly when auto-scaling, the new server couldn't connect to memcache (ElastiCache). Note that when you migrate over to VPC you have to migrate everything -- launch new ElastiCache servers in VPC, EC2 servers, RDS servers, etc.
Back to the bug.. I'd ssh into the EC2 server, and when I telnetted to memcache, it wouldn't connect. I terminated the EC2 server, and a new server comes up and can connect fine. I made a forum post in AWS forums and got zero responses. We then bought into AWS support and I submitted a ticket.
The problem: I launched my ElastiCache servers in the same subnet as my EC2 servers. Apparently the ElastiCache servers by default remembers servers in the same subnet by Mac address. Since we were cycling EC2 servers, eventually we'd get one with the same Mac address but new internal IP address, and I'm no networking guy but apparently this caused a routing problem.
Solution: create a new subnet and launch all the ElastiCache servers in that subnet. I did that, and it fixed the problem. The AWS support rep said if the ElastiCache servers are launched in their own subnet it will force them to go by IP instead of Mac address.
Anyway, hope this helps someone out ;)
This sounds like an EC2 bug, not an ElastiCache bug. EC2 uses "locally administered addresses" for MACs (as it should), but the administrator assigning MAC addresses is responsible for ensuring that addresses are not reused within a collision domain.