n.b. Our backups run outside of the hotspot times for Tarsnap, so we may have had less performance impact than many customers. I have an old habit of "Schedule all cron jobs to start predictably but at a random offset from the hour to avoid stampeding any previously undiscovered SPOFs." That's one of the Old Wizened Graybeard habits that I picked up from one of the senior engineers at my last real job, which I impart onto y'all for the same reason he imparted it onto me: it costs you nothing and will save you grief some day far in the future.
"AccuracySec=" in *.timer files lets you specify the amount of slack systemd has in firing timers. To quote the documentation "Within this time window, the expiry time will be placed at a host-specific, randomized but stable position that is synchronized between all local timer units."
You may still want to randomize timers locally on a host too, but the above makes automated deployment of timers that affects network services very convenient.
Yes, that sounds about right. I had maybe half a dozen people write to me who had noticed performance problems, and after the initial "backups failed because the server hit its connection limit" issue, it was people whose backups were already very long-running -- if your daily backups normally take 20 hours to complete, a 40% slowdown is painful.
FWIW, I live in Australia (so an 'off-peak' timezone), and schedule my cronjob on an odd minute offset, so it may not have been an issue for me anyway!
echo $((RANDOM % 60))
This might be platform dependent though, I can't find any standard RAND_MAX in bash so it's difficult to make this work everywhere.
sleep $(( 0x$(xxd -l2 -p /dev/random) * 3600 / 65536 ))
Add some metadata for a machine that tarsnap should expect a once a day/week/month backup from this machine, and if it doesn't get one, to send you an email?
Until the day when Colin considers it in-scope for Tarsnap, I recommend Deadman's Snitch for this purpose. I literally spend more on DMS to monitor Tarsnap than I spend on Tarsnap. No, I don't think that is just, either.
And the discussion on HN https://news.ycombinator.com/item?id=7523953
(This may be slightly overbuilt, but I felt it justified to get peace of mind, given AR's fair degree of importance to customers/myself and the enterprise-y customer base. In particular, I would not have been happy with any monitoring solution which would fail if I lost network connectivity at the data center.)
$15 a month is far below my care floor for making sure that my backups are working and that I do not get sued into bits.
# Generate a uniformly distributed unique number to sleep.
if node['chef_client']['splay'].to_i > 0
checksum = Digest::MD5.hexdigest(node['fqdn'] || 'unknown-hostname')
sleep_time = checksum.to_s.hex % node['chef_client']['splay'].to_i
sleep_time = nil
This is random enough so you won't kill the server, and deterministic so the resource isn't always changing every Chef run.
mental note: think harder, next time.