
Machine Learning for a Better Developer Experience - caution
https://netflixtechblog.com/machine-learning-for-a-better-developer-experience-1e600c69f36c
======
gridlockd
_" Imagine having to go through 2.5GB of log entries from a failed software
build — 3 million lines — to search for a bug or a regression that happened on
line 1M."_

If your build log files approach gigabytes in size, what you're going to need
to do is _not_ search for that bug. You're going to need to rethink your
career and life choices. You're going to need to take a vacation, do some
traveling, get some perspective. Figure out where it all went wrong and what
to make of the years that remain. Life is short.

~~~
Datenstrom
Or just use grep? Are there people building systems that keep logs that are
not easily grepable? I've never seen one, and I have seen some very poorly
designed systems. I do work outside of traditional environments though
(embedded, robotics, avionics).

~~~
vsareto
>Are there people building systems that keep logs that are not easily
grepable?

Yeah, I've seen ELK stacks, Splunk, or even RDBMS tables for log entries

------
Game_Ender
It would really be interesting to learn more about their domain. Builds logs
you have control over and should be able to keep clean and actionable. Service
or batch job logs could present a more difficult problem though. It would be
useful to try and spot a key message halfway through a failed 20 minute
process that is the reason you hit an error at the end. Again though using
logging levels and grep should usually be enough. You can also flag
excessively noisy code and improve the signal to noise of the output it
produces.

In specific with a build system like Bazel you might execute 1,000,000 actions
on a build producing lots of _internal_ output but you only see the errors,
and at most you have a few hundred lines to look through. That is managed in a
few key ways:

\- Test output is completely hidden unless that specific test fails

\- Build actions only produce output when they have an error, and it’s easy to
keep it that way (because they stand out, and can be quickly fixed)

\- Bazel does not tell you about _every_ action that is run only ones that
fail or take or a long time

~~~
dabei
I don't think they are talking about build logs. They are talking about the
logs of a version of their app. One indication is they gave an example of
identifying login errors.

~~~
thundergolfer
“Log entries from a failed software build” made me think they were talking
about build logs initially.

Was scratching my head at the idea of a build system spitting out 3 million
log lines.

------
stan_kirdey
hey folks, I am one of the authors of the article. Seeing an interesting
conversation here. Wanted to clarify a few points: Our logs in general, and
build logs produced by Jenkins in particular are all over the place. The usual
suspects like Java build logs can be grepped or tailed, but the complicated
use cases, that actually do produce anything from 10-20 megabyte to gigabytes
of output in console leave very little room to investigate - and we do have
lots and lots of these. I am happy to answer any questions here.

------
fwhigh
Friendly reminder: if you can grep, then grep. If you can tail, then tail. If
you can diff, then diff. This is an idea to try for all the cases left over.

------
kkaranth
Not sure I understand the problem they are solving, aren’t errors in logs
something than can just be grepped for?

(Also: really cool illustrations. Very unexpected in a tech blog.)

~~~
ishcheklein
The way I understand this. They are doing "fuzzing diffing", I would say even
semantic diff that can "understand" that "log in error, check log" and
"problem authenticating" are the same (close). They apply it to compare (GB
scale) files with application or build logs to identify anomalies efficiently.
Otherwise if you just run `diff file1 file2` it would produce enormous result
in the real life app. At least in their case.

Machine learning is used in the "understands" part above ^^.

