
Ask HN: Keeping track of random stack traces - beatthatflight
Every so often in a piece of software we&#x27;re testing, you get a random crash. If you&#x27;re unlucky it&#x27;s a nice race condition, hasn&#x27;t happened 50 times in a row, then boom, it happens again.<p>In terms of tracking it for the future, what suggestions do people have? I can backlog it as a bug, but it&#x27;s not going to be easily searchable.  A dev could pick it up, but without a method to reproduce, it&#x27;s not easily fixed in a sprint either.<p>And it&#x27;s also hard to know if it gets fixed in the future either!<p>But it&#x27;s also a crash, and I personally hate not documenting them, no matter how rare.  But I&#x27;d like a better way to manage it.
======
wallstprog
Pretty low-tech, but what we do is to create an md5 hash of the whole stack
trace as a single string. Before hashing, we munge some bits to help make
similar stack traces hash to the same value:

\- remove file/line#

\- omit the bottom (top) frame, which can be different between environments

\- convert certain constructs to a common format (e.g., "unknown module"
(clang) to "???" (valgrind)

\- translate "func@@GLIBC_version" => "func"

This works well enough in practice for our purposes (identifying regressions
and suppressing specific reports from valgrind/Address Santizer).

We also maintain a xref between md5 hash and full stack trace.

------
mceachen
You didn't specify where your software is running. I'm building software that
my users install and run on their own hardware.

I send errors to Sentry.io when the error contains a novel stacktrace for the
user and the user hasn't disabled error reporting. I also send recent log
messages and some other info, like the OS and hardware architecture (so I can
reproduce it on my end). [1]

PhotoStructure uses a SHA of the stacktrace to discriminate between different
errors. This certainly can group different problems together, but in practice
those problems are related.

Only sending novel stacktraces prevents a user from clogging up my Sentry
dashboard, and from wasting my users' bandwidth. PhotoStructure imports huge
libraries, and before I added this squelching, I could have a single user send
tens of thousands of reports (when the "error" turned out to be an ignorable
warning due to the camera they were using writing metadata that was malformed
but still parseable).

If you're building a SaaS, and you own the hardware you're software is running
on, just send all errors to Sentry.

Sentry does a good job in helping me triage new errors, marking when errors
crop back up, and highlighting which build seems to have introduced a novel
error.

Keep in mind that the stacktrace may not be relevant if that section of code
or the upstream code is modified. I use automatic upgrading on all platforms
to keep things consistent.

[1] [https://photostructure.com/faq/error-
reports/](https://photostructure.com/faq/error-reports/)

------
babygoat
Sentry is great for this.

------
nitwit005
At previous company there was a home built service which was a database of
unhandled Java exceptions. It attempted to generate a hash value so that you
could see how often exceptions were happening, and graph them over time.

Highly imperfect of course, and it created separate entries for some
exceptions that included random numbers in their message. But it did put
pressure on people to clean them up.

------
fierarul
A little automation can help you here. Errors could be auto-transformed into
bugs somewhere and duplicates just add +1s or votes, etc. How you detect
duplicated depends on your configuration but it should be doable (eg. use the
stacktrace sha as a tag).

------
dev_north_east
I feel you.

> I can backlog it as a bug, but it's not going to be easily searchable.

In my experience, I've marked it as a bug, comment with the stack trace and
mark as U. Then when it arises again, hopefully someone searches for a part of
the stack and comes lucky or more often than not, I (or others) will hear of
the crash and relay the bug info. Bug is updated with any new info and live
continues until it crashes again... Not perfect by any means. I'd love to hear
how others deal with this

~~~
quickthrower2
I’ve also: Added more logging code for future cases, and or try to reason
about how it could have occurred. Try to make it impossible if you can, but
sometimes that requires too big of a rearchitecture to be worth it.

------
RabbitmqGuy
I have an upcoming product in this space. It basically lets you send errors
and their stack traces to datadog. You can then search, aggregate, filter etc
your stack traces

You can email me if this interests you (email is in my bio)

------
drewg123
I'd suggest looking at backtrace.io It may be overkill for what you want to
do, but one thing it does really well is to log stack traces.

