
Introduction to Architecting Systems for Scale - fogus
http://lethain.com/introduction-to-architecting-systems-for-scale/
======
3amOpsGuy
Good read. I don't think you were controversial :-)

Spotted a wee typo about 1/2 way down:

>> LRU works by evicting less commonly used data in preference of more
frequently used data

For this question:

>> Does anyone know of recognized tools which solve this problem?

BMC's Control-M product manages this fairly easily, although it is easy to let
the workflow become unweildy with that product in my experience. AutoSys fairs
a little better for this use case.

Open source wise I guess you could use PBS or something of that ilk to
replicate. I think though an ideal architecture for this problem wouldn't be
what's currently available.

I think a hot-hot message queue with deduplication would be a better approach.
You can afford then to have multiple hosts submit an appropriately named job
and the first node on the other side of the queue to successfully lease the
message wins the right to run the task contained within. If it fails to
complete the next node leases the task.

It would require some consideration about ensuring integrity of the message
and authentication requirements for publishers.

------
ChuckMcM
Nicely summarized on the network layer, next you'll want to expand the
'database' box into its components and a storage layer and its components.

There is also an interesting layer of networking services which involve
routability and validation (certificate checking etc) and then there is the
third party API scale so sometimes you're generating traffic back out to
things other than a CDN (like Twitter or Facebook or some Google thing)

Part 4 should be looking at it from the data center side, which is these
things are breaking all the time, building scalable repair systems that give
100% uptime on unreliable hardware.

It goes on and on and on ...

------
kzahel
Thank you for the article. It was well written and an enjoyable read!

