Hacker News new | comments | show | ask | jobs | submit login

In case any other customer is wondering "Wait, I didn't hear anything from my monitoring about that and I'm retroactively worried. How worried should I be?" like I was: I just pulled our logs and reconstructed them, and it shows over the last ~30 days that the worse-case performance of our daily backup (~150 MB per day delta, ~45 GB total post deduplication) was about 40% longer than our typical case. This didn't trip our monitoring at the time because they all completed successfully.

n.b. Our backups run outside of the hotspot times for Tarsnap, so we may have had less performance impact than many customers. I have an old habit of "Schedule all cron jobs to start predictably but at a random offset from the hour to avoid stampeding any previously undiscovered SPOFs." That's one of the Old Wizened Graybeard habits that I picked up from one of the senior engineers at my last real job, which I impart onto y'all for the same reason he imparted it onto me: it costs you nothing and will save you grief some day far in the future.




Explicit support for randomizing timers across multiple hosts is a really nice features of the timers provided by systemd:

"AccuracySec=" in *.timer files lets you specify the amount of slack systemd has in firing timers. To quote the documentation "Within this time window, the expiry time will be placed at a host-specific, randomized but stable position that is synchronized between all local timer units."

You may still want to randomize timers locally on a host too, but the above makes automated deployment of timers that affects network services very convenient.


the worse-case performance of our daily backup (~150 MB per day delta, ~45 GB total post deduplication) was about 40% longer than our typical case

Yes, that sounds about right. I had maybe half a dozen people write to me who had noticed performance problems, and after the initial "backups failed because the server hit its connection limit" issue, it was people whose backups were already very long-running -- if your daily backups normally take 20 hours to complete, a 40% slowdown is painful.


I run my backups overnight and get a status email each morning, and I didn't even realise there were performance issues until now. As you said, unless you run your backups multiple times per day, or have long-running backups, it may not have had a lot of impact.

FWIW, I live in Australia (so an 'off-peak' timezone), and schedule my cronjob on an odd minute offset, so it may not have been an issue for me anyway!


Hear hear on said Old Wizened Graybeard habit. The amount of pain inflicted from twenty jobs all starting up at :00 (or even :30, :45, etc.) when they could easily run at :04 or :17 can be huge. Anecdotally I once "lost" a sandbox server to a ton of developer sandbox jobs starting at :00 and not completing before the next batch started.


Funny part to that, was on a project with multiple teams with multiple crontabs. Each team took that advice to heart for some jobs. Sadly, we had too many Hitchhiker fans and :42 became a bit too common.


Use the following shell command to decide when to run cron jobs.

    echo $((RANDOM % 60))
It's not a CSPRNG, but good enough for this kind of load balancing!


Or schedule your cron job for :00, but add "sleep `jot -r 1 0 3600` &&" to the start of the command. (jot is a BSDism, but I assume you can do the same with GNU seq.)


This is a pain when deciphering a series of events later, though, because you don't know when a particular job was supposed to start. I'd prefer the delay to be stable on a per-host basis.


Don't use that for hourly jobs, though - things are liable to break when you randomly run a command at, say, 12:59 and 13:00.


Right, I usually do that for my daily jobs.


sleep $[RANDOM/3600] works everywhere without requiring jot/seq etc. on BSD/Mac/Linux.


That will be a number between 1 and 10 ($RANDOM only goes to 32767), sleep $[RANDOM/10] would be better. :)

This might be platform dependent though, I can't find any standard RAND_MAX in bash so it's difficult to make this work everywhere.


This works in (da)sh (tweak 2 and 65536 if needed):

    sleep $(( 0x$(xxd -l2 -p /dev/random) * 3600 / 65536 ))


s/\//%/ I assume?


Oops

  s/\//\\%/
yeah.

  sleep $[RANDOM\%3600]


We just went with a single group text file with all the jobs and which ones could be spread out. Saves the programming and gives the sys admins / DBAs an idea what goes when.


Don't run on :17 and :39. Those are mine. Thanks!


One way to think about your fear is, shouldn't that just be a tarsnap feature?

Add some metadata for a machine that tarsnap should expect a once a day/week/month backup from this machine, and if it doesn't get one, to send you an email?


whistles

Until the day when Colin considers it in-scope for Tarsnap, I recommend Deadman's Snitch for this purpose. I literally spend more on DMS to monitor Tarsnap than I spend on Tarsnap. No, I don't think that is just, either.


For those interested in patio11's thoughts on how he would run tarsnap http://www.kalzumeus.com/2014/04/03/fantasy-tarsnap/

And the discussion on HN https://news.ycombinator.com/item?id=7523953


Did Colin ever reply to that? I've always wondered what his response was.




Don't you have some other servers running other services? So you must already have some monitoring and alerting system like Nagios, to which you can add one more little "passive check" that does the same thing, for no incremental cost?


I have roughly fourish separate monitoring systems for Appointment Reminder. DMS is the one which is least tied to me, so I use it for Tarsnap (the most critical thing about AR that can fail "quietly") and as the fourthish line of defense for the core AR functionality.

(This may be slightly overbuilt, but I felt it justified to get peace of mind, given AR's fair degree of importance to customers/myself and the enterprise-y customer base. In particular, I would not have been happy with any monitoring solution which would fail if I lost network connectivity at the data center.)

$15 a month is far below my care floor for making sure that my backups are working and that I do not get sued into bits.


Touché :)


I'll second it (https://deadmanssnitch.com/). It's such a useful tool, it's saved my bacon more than once.


We actually have our Chef rdiff backup cookbook randomly distribute jobs across a buckets of time using a hash function of the hostname.


I have to know: Why a hash function of the hostname?


The chef-client cookbook does a similar thing in its cron recipe:

  # Generate a uniformly distributed unique number to sleep.
  if node['chef_client']['splay'].to_i > 0
    checksum   = Digest::MD5.hexdigest(node['fqdn'] || 'unknown-hostname')
    sleep_time = checksum.to_s.hex % node['chef_client']['splay'].to_i
  else
    sleep_time = nil
  end
https://github.com/opscode-cookbooks/chef-client/blob/master...

This is random enough so you won't kill the server, and deterministic so the resource isn't always changing every Chef run.


No hash collisions, hostnames (in almost all practical environments) are never identical.


It's more or less random, but stable.


I ran into this recently, backing up munin data to s3. I ran it at a time point offset from an hour to avoid those 'on-the-hour' rushes, but I was getting problems with the copy. Took me a moment to realise I was doing it on a 5-minute boundary, and munin fires on a 5-minute boundary - the data was being updated as I was copying it...

mental note: think harder, next time.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: