
Logs were our lifeblood, now they're our liability - vinnyglennon
https://vicki.substack.com/p/logs-were-our-lifeblood-now-theyre
======
ping_pong
All this effort to distill user behavior and user intent from logs... why not
just ask them?????

I'm still waiting for the option to see the ads that I want to see. I want to
see movie trailers, something that I rarely see now because I don't watch TV.
I want to see new video games. I want to see new books, what sports event are
going on, etc. Why not ask me? It's literally not rocket science here but
billions are spent on machine learning, clicks, storing exabytes of data or
more trying to figure this shit out.

Just ask me for fuck's sake, I'm more than willing to watch ads in exchange
for a useful service!

~~~
gwbas1c
Honestly, I want a subscription service that allows me to see most major
websites without ads.

I think someday someone's going to realize just how silly the advertisement
game is, and as long as the payment structure is in place, we can get a much
better web experience.

For example, may of us pay a small monthly fee for Netflix. I'm sure that a
small monthly fee could add up to more than what most sites make from ads.

~~~
wutbrodo
> think someday someone's going to realize just how silly the advertisement
> game is, and as long as the payment structure is in place, we can get a much
> better web experience.

You're not the first, or tenth, or millionth person to think of this. Hell,
even just limited to HN, micropayments and general content subscriptions have
been discussed for a decade. Consumers are in a way that equilibrium where
they don't want to pay for web content (esp text web content), and the path to
getting them to the equilibrium of paying without thinking about it (like with
Netflix or power) is unclear.

It's not just theoretical: Companies like Google have also been experimenting
with this for yeaaars, to diversify away from the risk (whether regulatory or
technological or otherwise) of relying on ads as a primary revenue source.
There are complications beyond consumer behavior, like bringing the colosally
complicated ad ecosystem under a single payments system (since nobody wants to
pay for a service that only removes some fraction of ads from the web).

~~~
majewsky
> Consumers are in a way that equilibrium where they don't want to pay for web
> content (esp text web content), and the path to getting them to the
> equilibrium of paying without thinking about it (like with Netflix or power)
> is unclear.

You disprove yourself by mentioning Netflix. The path is absolutely clear:
Customers are willing to pay for added value that's proportionate to the cost.

The problem for publishers is they do not add any value that would justify
customers paying enough for their content. Few people will pay for a newspaper
subscription when there are 10 other newspapers offering 90% of the same
content for free.

There are models that work, e.g. Patreon, but those usually don't scale up to,
say, the Washington Post or CNN.

~~~
wutbrodo
> You disprove yourself by mentioning Netflix

This isn't how equilbria work. Netflix was a superior product to piracy in
many ways: no perceived legal risk, reliable access, high quality guaranteed,
way better ease of use. These barriers were high enough that plenty of people
didn't pirate at all and stuck with nonsense like DVDs for way too long, so
the incentive path pointed smoothly towards switching to Netflix, a Pareto
improvement for non-pirates and a fairly easy trade-off for pirates.

There's no such path for web content: adblockers are unquestionably legal,
easy to set up, provide a better experience, and even non-users of adblockers
have a trillion non-paywalled sources in an ecosystem where it's tough for
strong brand loyalty to survive en masse. What advantages do you imagine a
paywall option offering to people when their alternative is better in almost
every respect?

> There are models that work, e.g. Patreon, but those usually don't scale up
> to, say, the Washington Post or CNN.

------
EdwardDiego
The author touches on the cargo cult belief that data has inherent value, so
collect all of it, hire a data scientist or ML expert and point them at it,
and watch the value flow!

Seems to come from the same place as the belief that the Cloud magically makes
everything resilient and scalable without any extra effort on your part. Just
put it in the cloud, and then give your CTO a bonus for suggesting the cloud,
and suddenly you don't need to worry about sysops.

Don't get me wrong, "log all the things" is a good place to start when you
need to figure out what's actually worth logging - but it needs to be followed
by a rigorous prune.

Otherwise your data-lake turns into a data-swamp, you collate a lot of noise
that makes it harder to find signals, and people eventually end up spending a
lot of time trying to figure out what's actually used, if any, when Hadoop
gets full or the S3 bill gets too high.

------
joshmarinacci
I was really hoping this would be about the lumber industry.

~~~
crummy
Thanks to GDPR, we're no longer allowed to count the rings to reveal PII.

~~~
wpasc
Not to mention the crazy restrictions around acorns

~~~
bicknergseng
They're nuts.

------
rubbingalcohol
This brings up really good points. One of the best practices going forward is
to minimize storage (and logging) of any personally identifiable information.

Under GDPR, IP addresses can be considered PII, so it makes sense to set up an
anonymizer for nginx ip address logs. There is a great Stack Overflow answer
on this: [https://stackoverflow.com/questions/6477239/anonymize-ip-
log...](https://stackoverflow.com/questions/6477239/anonymize-ip-logging-in-
nginx)

But also there's some app hygeine involved. At least one of the recent "data
breach" notifications involved not an actual leak of personal information, but
unsanitized logs containing personal information that should not have been
shared intra-organizationally. I forget the company that did this, but they
notified as if it had been a breach even though passwords had just been logged
internally.

When testing it's convenient to do stuff like ``` console.log('username: ',
req.body.username); console.log('password: ', req.body.password); ``` but it's
all too easy to forget about it when you're working on a million things. So a
big part of the solution is mindfulness (do I _really_ need to log this?)

~~~
pmoriarty
_" Under GDPR, IP addresses can be considered PII, so it makes sense to set up
an anonymizer for nginx ip address logs."_

One thing I wonder about is what you would do if, say, you have an abuser on
your site that you need to ban due to behavior detected after the fact through
a log file.

If one needs their IP in order to ban them, but their IP is anonymized, what
do you do?

~~~
DiabloD3
One way hash of the IP.

~~~
jacquesm
The search space is so small you can simply create a table with all of the
hashes and use that to reverse the hash.

~~~
GhettoMaestro
You could always salt it.

~~~
majewsky
Salts are only useful when you already know which hash to check e.g. because
the user supplied the username that picks which password hash to check. When
matching an IP against a list of salted hashes, you need to hash the input
with every possible salt to compare against all hashes in the list. So for
performance reasons, it's probably not feasible to use more than one salt (or
a small number of salts). Then it's again very possible to reverse the list of
salted hashes because the search space is number of all IP addresses times
number of salts used, which is way less than the number of all hashes.

------
dxbydt
The 5 user Nielsen test referred to in this article is quite inaccurate. If (
and that's a very big if) the users are IID, then yes, 5 users is all you
need. But your users from Russia aren't the users from USA aren't the users
from China etc. Even if your userbase belongs to single country, there's
differences between CA user vs TX user vs NY user etc. Further, the analysis
isn't static in time! So you as a single user will be a different person
tomorrow because you are more familiar with the software, or your mood is
better, or your worktable/mind is less cluttered so you can pay more attention
etc. In other words, world isn't multinomial coins with fixed head
probabilities. Here the coins are people & the probabilities change over time.
So the only sufficiency statistics are order stats. Hence logs. Nielsen adds a
massive caveat "The formula only holds for comparable users who will be using
the site in fairly similar ways" \- its possible to find such people in very
homogenous groups. Like if you have a GRE saas & target all only the white
college kids taking the GRE, hopefully 5 white kids is enough. Now you bring
in black & brown & hispanic & chinese & so forth...maybe 5 of each. Or maybe
you want to separate by sexes, so 10 of each...it gets complicated very soon,
which is why its much simpler to just log everything/everybody.

~~~
zer00eyz
> The 5 user Nielsen test referred to in this article is quite inaccurate

> "The formula only holds for comparable users who will be using the site in
> fairly similar ways"

This is the article where Neilson breaks down what is being tested, and why it
is statistically relevant.

[https://www.nngroup.com/articles/why-you-only-need-to-
test-w...](https://www.nngroup.com/articles/why-you-only-need-to-test-
with-5-users/)

Neilson is looking to solve HCI(1) and Human Factors(2) issues - and most of
these are byproducts of having to have a deep (insider) understanding of a
product and that bubbling up into your UI. You are going to catch a lot of
errors that fit the adage "can't see the forest for the trees". Having sat
through a LOT Of these tests, you will pick out user frustrations, and reasons
for product abandonment that would likely be NON apparent in a log.

Your examples of US/China, TX vs CA and GRE with race and class MIGHT be
relevant but it is going to depend a whole lot more on what your building. The
problem is there are other means and places where these issues might manifest,
and again use testing would tell you a lot.

If we were to build a VR game that used a chopstick like interface, and test
it only in china, we would likely think that we had a good product. If we find
out later that "this isn't selling in America" then testing in that
demographic group would quickly give us the insight that people lack the
muscle memory to use this intuitively. There isn't any log in the known
universe that would give us that clue, and "test here" can (and likely would)
be gleaned by other means.

When you get past HCI and Human factors log data can be useful, and be a
contra-indicator of the results of formal testing. Given a choice between A
and B in a formal setting may give you one set of results even with a large
sample size, but real world behavior turns out to be very different. This is
akin to people slowing down when they see a police car - but driving fast when
one isn't present or kids acting differently because they know someone is
watching. We aren't discussing UI and UI interactions were now discussing
human behavior, and preference. I can't tell you how many times I have seen
the non preferred solution be the winning one in an A/B test, but I would
generally bet against what the group likes and pick the most garish solution
as the winner.

These behavioral types of tests can only really be driven by logs, by people
being themselves and "feeling" unmonitored, and accurate demographic (to your
point) slicing and sorting. En mass people are far more predictable than they
would like to believe. Were delving into something more along the lines of
Asimovs Psychohistory(3) as I don't think these sorts of statistically
predicable behaviors have been given a formal name.

1\.
[https://en.wikipedia.org/wiki/Human–computer_interaction](https://en.wikipedia.org/wiki/Human–computer_interaction)
2\.
[https://en.wikipedia.org/wiki/Human_factors_and_ergonomics](https://en.wikipedia.org/wiki/Human_factors_and_ergonomics)
3\.
[https://en.wikipedia.org/wiki/Psychohistory_(fictional)](https://en.wikipedia.org/wiki/Psychohistory_\(fictional\))

------
dredmorbius
Addressing the value (and limits) of sampling:

Yes, statistical sampling _is_ a hugely useful practice, and is frequently
used, at least by those who are familiar with its power and capabilities.

Depending on _what_ you can see, it may or may not be particularly useful. For
activity logs, you _are_ getting a bunch of relevant information, though if
you stick to just sampling log records, you may miss useful information, such
as paths through a site, session data, and the like.

In doing analysis of the scale and scope of usage and activity of the late and
unlamented Google+, I had the opportunity to sample based on _profile IDs_ ,
which Google had helpfully stashed in a set of robots.txt sitemap files, back
in 2015. More recently, when seeking information on the number, size, and
activity of G+ Communities (effectively: groups), I could perform a similar
sampling based on the group IDs, also provided via sitemaps.

For a basic assessment of how many active users and groups there were, a small
sample, as few as 100 or so IDs, _selected at random_ , were sufficient to
give a general feel. But there's a lot of variance hidden in 2 billion
registered users (as of 2015), or the 8 million Communities existing as of
January 2019. And for detailed measurement of the most active users and
groups, a very small fraction of the total (0.1% of users, and the top 50 or
so of 8 million communities, or 0.000625%), the releative sampling population
wasn't the total user or group count, but that small subset, _randomly
distributed throughout the whole_ , comprising that sample of interest.

To find the very most active users and groups, in other words, you have to
sample a _lot_ of datapoints.

(Mind: if I'd had log data, they'd have fallen straight out of that. I didn't.
Which is itself another lesson: in most cases you're interested in _activity_
and not _population_ as a primary analysis variable.)

Given my tools and methods -- requesting URLs and scraping, from a desktop
system over residential broadband -- there were limits to the amount of
sampling I could do. 50,000 profiles were doable in a couple of days, but a
larger pull would have scaled linearlly in time. For Communities, I did a
largish pull based on a minimum level of resolution I thought would be useful,
based on 12,000 (again, randomly selected) Communities.

In the end I lucked out as a third party was able to provide a comprehensive
dataset of _all_ 8 million communities and summary metadata, from which I
could validate my earlier sample-based methods.

But yes, working with hundreds or thousands of records, rather than millions
or billions, often makes sense, is useful, and requires _vastly_ fewer
resources (compute, time, bandwidth).

For getting a rough idea of just

