
Tarsnap performance issues in late March, most of April - pndmnm
http://mail.tarsnap.com/tarsnap-announce/msg00031.html
======
Someone
Good description, but I'm missing lesson learned #0: Do not wait too long
before informing your users, even if only to tell them "we know about it and
are working on it"

------
Osiris
For those that want to run a similar service using their own systems, I found
that Attic [1] is a great open source backup tool that works in a very similar
way, including deduplication and compression.

I backup some VPS servers to my NAS at home using attic over an SSH tunnel.
Incremental backups are quite small and it's easy to automate with a simple
cron job.

[1] [https://attic-backup.org/](https://attic-backup.org/)

~~~
middleclick
How does this compare to Duplicity?

~~~
scott_karana
It uses git-style addressable blob storage, so you don't have to worry about
deltas, because there _aren 't_ any.

It's also got more efficient deduplication, because it doesn't use rsync's
naïve algorithm.

The downsides: it requires the agent to be remotely installed (a la rsync: no
"dumb" backends), and supports less storage backends to boot.

YMMV :-)

------
appsonify
what the fu....Colin Percival used to be my cello teacher 12 years ago....and
he is running tarsnap. My mind is blown.

~~~
cperciva
Not me -- I'm a violinist, and I've never taught anyone violin either.

Maybe you're thinking of my bother (Graham)? He was teaching cello around that
time period I think.

~~~
appsonify
Oh wow. Yes! Now I remember it was Graham. I was his student around then. Saw
your photo on twitter and looked exactly like him! Hahaha. How is Graham?

My mind is just completely blown right now.

~~~
cperciva
Graham is in Japan right now, but returning to Canada soon; he'll be joining
me in the West Coast Symphony for our June concert.

I don't feel that I should really be talking too much about my family in a
public forum, but if you'd like to send me an email I can forward it to him.

------
cperciva
I suppose I should have known that this would end up at the top of Hacker
News...

~~~
JacobAldridge
It's the picodollars - tarsnap was the second business I fell in love with on
HN (the late Kiko was #1) purely because of the awesome vibe I felt emanating
from your enterprise (which I'm assuming is a reflection on you as well).

Years later, you've also become a _cause celebre_ for holding true to a clear
business and lifestyle vision (again, perceived at distance), in spite of the
recommendations and 'support' provided by Patrick and others, including
myself. Keep being true, and I suspect the community will keep learning from
you Colin.

~~~
cperciva
_It 's the picodollars_

Hey Thomas, are you listening here?

In all seriousness, the picodollars do an excellent job of attracting exactly
the sort of customers I want... and turning away the customers I _don 't_
want. They were originally part joke and part a way to avoid arguments with
customers who don't understand that 1 GB < 1 GiB, but now it's way more than
that.

 _in spite of the recommendations and 'support' provided by Patrick and
others_

Don't be too harsh on Patrick. His vision for Tarsnap is not my vision for
Tarsnap, but he has helped me to orient myself: The projection of "business"
onto the subspace "geek" doesn't look very much like "business", but it's not
the same as "kid right out of university who has never had a real job" either,
and that's what you would see if I hadn't had advice (from Patrick, Thomas,
various YC people, and the rest of HN).

Advice can be very valuable even if you don't follow it to the letter.

~~~
JacobAldridge
Don't get me wrong - I _probably_ still agree with Patrick and Thomas.

(Previous:
[https://news.ycombinator.com/item?id=7731268](https://news.ycombinator.com/item?id=7731268))

[Edit: And I think the competing theories are an excellent lesson for that
"kid right out of university who has never had a real job".]

------
k1w1
As an AWS user this type of thing gives me cause for concern:

 _At 2015-04-01 00:00 UTC, the Amazon EC2 "provisioned I/O" volume on which
most of this metadata was stored suddenly changed from an average latency of
1.2 ms per request to an average latency of 2.2 ms per request. I have no idea
why this happened -- indeed, I was so surprised by it that I didn't believe
Amazon's monitoring systems at first -- but this immediately resulted in the
service being I/O limited._

A sudden doubling of latency can have dire consequences on any system. Knowing
that such unexpected changes are possible makes it built trust in your
environment, even if it is running fine today.

~~~
cperciva
Indeed, I didn't know such a change was possible -- that EBS volume went for
years with consistent low latency before it suddenly slowed down.

~~~
jeffbarr
You could have contacted AWS support or emailed me. Either way, we would have
investigated.

~~~
cperciva
It wasn't missing its guaranteed # of I/Os per second, so I figured the
slowdown was just "one of those things" and not an out-of-spec issue. Happy to
send you the volume ID if you think someone would want to investigate (and
still has data from the start of April) though.

~~~
jeffbarr
Yes, please do.

------
mtsmith85
This line: _I would have sent out an email to the mailing lists earlier; but
since at each point I thought I was "one change away" from fixing the
problems, I kept on delaying said email until it was clear that the problems
were finally fixed" _ is such a common situation for most people, but I tend
to see it with engineers especially. I find I struggle with it an incredible
amount. In some ways, I guess it seems healthy or reassuring that incredibly
smart people like Colin Percival suffer from similar challenges around fully
understanding the scope of the problem and the solution.

All that being said, I really respect the detailed response from a technical
perspective as well as owning up to (and the decisions that went into) a spell
of downgraded performance.

Later edit because I don't want to spam the comments: I'd love some context
(maybe from cperciva himself?) around the performance enhancement of
integrating new Intel AESNI instructions. This is well beyond my depth and
while Colin mentions that it didn't necessarily increase performance, I'm
wondering if the hope is it would longterm? Or were there other benefits to
such an integration?

~~~
cperciva
_I would have sent out an email to the mailing lists earlier; but since at
each point I thought I was "one change away" from fixing the problems, I kept
on delaying said email until it was clear that the problems were finally
fixed_

This ties in to the last lesson I mentioned at the bottom:

 _5\. When performance drops, it 's not always due to a single problem;
sometimes there are multiple interacting bottlenecks._

Every time I identified a problem, I was correct that it was a problem -- my
failing was in not realizing that there were several things going on at once.

~~~
jcrites
> Every time I identified a problem, I was correct that it was a problem -- my
> failing was in not realizing that there were several things going on at
> once.

Very common! One thing that's been helpful for us is establishing predefined
system performance thresholds that, if exceeded, initiate the chain of events
that will lead to customer communication. "If X% of requests are failing, then
we had better advertise that the system is degraded." Discussing and setting
these thresholds in advance and the expectation that they'll result in
communication helps drive the right outcome. It's not perfect, because one is
always tempted to make a judgment call in the circumstance, which is
vulnerable to the same effect, but it's a good start.

Thanks for sharing!

------
btmorex
Why are you reinventing a scheduler when the OS (at least Linux) already
provides a good one?

~~~
cperciva
I'm talking about scheduling tasks within a single process.

~~~
btmorex
Threads

~~~
cperciva
Too much overhead. Also, concurrent systems are actively malicious.

~~~
btmorex
I don't believe that you have too many active connections for threading to
work. Passive connections can be handled by a single or small number of
threads. Modern Linux on modern hardware has no problem with many thousands of
threads and the overhead is minimal in $$$ compared to the time you wasted
debugging a scheduling problem.

As for concurrent systems being harmful, you just have to design your program
for threading in mind. Minimize shared state and be very careful when you
can't.

~~~
cperciva
One connection can have many outstanding requests.

~~~
btmorex
I would redesign your protocol to be request/response based akin to http.
Achieve performance by using multiple connections in the client. Simplicity >
efficiency especially if you don't have the engineering resources of a company
like Google.

And I'm out. The reply rate limiting is infuriating.

~~~
hamburglar
It's really easy to glibly criticize someone else's design decisions when 1)
you don't have a full understanding of their problem, architecture, or
rationale for that architecture, and when 2) the medium of the conversation
doesn't lend itself well to providing you a satisfactory explanation.

It seems as though you've gotten the tiniest glimpse of some details about the
system and went on to assume he made a boneheaded decision and you know
better. Do you have some secret evidence that he's incompetent and doesn't
have a good reason for his decision?

------
ac29
Sorry if this is offtopic, but can anybody explain the value proposition of
tarsnap to me? It seems like a nice service and all, but the pricing is an
order of magnitude more expensive than S3. If you are storing a few GB, this
might not matter ("over half of Tarsnap users spend under $1 per month on
storing their backups"), but if you have that little data, why not just dump
it on a free Dropbox/Gdrive/etc account?

For more data, why not just use one of the many compressed, deduplicated,
encrypted, incremental backup systems (attic comes to mind, I'm sure there are
others) then just sync to S3 at a tenth the cost?

~~~
segf4ult
Because tarsnap is cheap, incredibly well documented, open source, and run by
an awesome guy. It's an all around win-win.

~~~
stephenr
Rsync.net is even cheaper, has no requirement for a custom client, and
arguably are more dependable because they're not just reselling S3

Edit: not to mention they offer actual support not just "contact the author"
email link as a last resort.

~~~
GhotiFish
I contacted the author today. He responded to me in 30 seconds.

~~~
stephenr
Try in 18 hours. Can you call him when something fails?

I'm not saying he isn't responsive I'm saying depending on a one-man-band who
is responsible for the client software, server software and the underlying
storage system (ie he is the owner of the s3 account) seems like a _huge_
risk.

------
patio11
In case any other customer is wondering "Wait, I didn't hear anything from my
monitoring about that and I'm retroactively worried. How worried should I be?"
like I was: I just pulled our logs and reconstructed them, and it shows over
the last ~30 days that the worse-case performance of our daily backup (~150 MB
per day delta, ~45 GB total post deduplication) was about 40% longer than our
typical case. This didn't trip our monitoring at the time because they all
completed successfully.

n.b. Our backups run outside of the hotspot times for Tarsnap, so we may have
had less performance impact than many customers. I have an old habit of
"Schedule all cron jobs to start predictably but at a random offset from the
hour to avoid stampeding any previously undiscovered SPOFs." That's one of the
Old Wizened Graybeard habits that I picked up from one of the senior engineers
at my last real job, which I impart onto y'all for the same reason he imparted
it onto me: it costs you nothing and _will_ save you grief some day far in the
future.

~~~
mtsmith85
Hear hear on said Old Wizened Graybeard habit. The amount of pain inflicted
from twenty jobs all starting up at :00 (or even :30, :45, etc.) when they
could easily run at :04 or :17 can be huge. Anecdotally I once "lost" a
sandbox server to a ton of developer sandbox jobs starting at :00 and not
completing before the next batch started.

~~~
protomyth
Funny part to that, was on a project with multiple teams with multiple
crontabs. Each team took that advice to heart for some jobs. Sadly, we had too
many Hitchhiker fans and :42 became a bit too common.

~~~
kijin
Use the following shell command to decide when to run cron jobs.

    
    
        echo $((RANDOM % 60))
    

It's not a CSPRNG, but good enough for this kind of load balancing!

~~~
cperciva
Or schedule your cron job for :00, but add "sleep `jot -r 1 0 3600` &&" to the
start of the command. (jot is a BSDism, but I assume you can do the same with
GNU seq.)

~~~
junkblocker
sleep $[RANDOM/3600] works everywhere without requiring jot/seq etc. on
BSD/Mac/Linux.

~~~
cperciva
s/\//%/ I assume?

~~~
junkblocker
Oops

    
    
      s/\//\\%/
    

yeah.

    
    
      sleep $[RANDOM\%3600]

