Just at the IP level they have to deal with (at the edges and across substantial WANS) BGP - a notoriously ugly and fragile protocol. Internal routing protocols such a OSPF are equally ugly and prone to breakage. Many are the tales of some small company misconfiguring their edge routers slightly (say a 1 char typo) and having the entire internet route through their T1, across their lan, out their backup T1. Other issues are BGP flapping, resulting in scary percentages of lost traffic. This doesn't even cover trickier stuff like routing loops...
Other considerations in big routers are things like ASN identifiers and peering points. Considerations like traffic cost, SLAs and QoS all go in to traffic balancing on such routers. MPLS clouds complicate (and oddly enough simplify) these things as well.
There are also important issues like Anycast, CDN and NAT that largely rely on router tricks and add to the complication.
Finally, on top of all this, is the security concern - you can't just throw a firewall in front of it, as many firewall issues are routing issues, therefore must also be present in the router.
All these layers interact and affect each other. Any given machine can only handle so much traffic and so many decisions, so something that is drawn as a single router on a networking chart may actually be several boxes cascaded to handle the complexity.
Oh yeah, and switches are getting progressively smarter with other rules and weirdnesses that provide horribly leaky abstractions that shouldn't matter to the upstream router, but turn out to add issues to the configuration and overall complexity on top of it all.
> Many are the tales of some small company misconfiguring their edge routers slightly (say a 1 char typo) and having the entire internet route through their T1, across their lan, out their backup T1.
This is what route filters are for. If you peer with someone and they advertise 0.0.0.0/0 or something equally ridiculous, and you accept this as a valid route then you deserve to fail (and then given a firm stare if you then proceed to advertise it to other peers).
A similar fail on the part of Telstra (http://bgpmon.net/blog/?p=554) was to blame for much of Australia dropping off the map earlier this year.
> Just at the IP level they have to deal with (at the edges and across substantial WANS) BGP - a notoriously ugly and fragile protocol.
I take offence at this. 80% of BGP related issues are due to misconfiguration by a given party, 19% is due to bad or missing route filters and the other 1% is due to bugs in router software. The actual implementation of BGP v4, originally designed back in the early 90's isn't completely without it's issues and behavioural quirks (I'm looking at you, route flaps) but the theory/algorithm behind is a work of art, and has coped amazingly with explosive growth, and growth that's only going to increase with IPv6.
Without it, there would be no HN
Doing the wrong thing to router table(s) is the network equivalent of "sudo rm -rf /".
excluding static routes (which are then usually advertised to other peers), routING tables are dynamically built and only exist in non-persistent memory.
having up to date backups of router configuration is another matter entirely
At a first level, routers will cache information about what machines are on which port of the router. This enables them to not forward a packet to every port on the router (reducing network load). This is normally done using MAC Addresses.
On more expensive routers, the router can understand IGMP, or multicast and route based on multicast joins. This enables an optimization on simple broadcasts because not every machine on the network needs to see these packets, but multiple machines might want to.
It gets really complex really fast, and even the slightest lapse in attention to extreme detail can be disastrous and a nightmare to troubleshoot.
For instance, I once had a router reboot for some unknown reason, and the firmware/config wasn't properly flashed, so it reverted to the last point release of the firmware. That was enough to cause the failover heartbeat to be constantly triggered, and the master/slave routers just kept failing over to each other and corrupted all the ARP caches. Makes for a lot of fun when things work and then stop working in an inconsistent manner.
It is a little sad that they didn't try rebooting the routers for so many hours.
This assumes they're being literal and "corrupt" means "a bit randomly flipped" instead of using "corrupt" in the figurative sense of "operator error".