Hacker News new | past | comments | ask | show | jobs | submit login
Logs were our lifeblood, now they're our liability (vicki.substack.com)
188 points by vinnyglennon 13 days ago | hide | past | web | favorite | 92 comments

All this effort to distill user behavior and user intent from logs... why not just ask them?????

I'm still waiting for the option to see the ads that I want to see. I want to see movie trailers, something that I rarely see now because I don't watch TV. I want to see new video games. I want to see new books, what sports event are going on, etc. Why not ask me? It's literally not rocket science here but billions are spent on machine learning, clicks, storing exabytes of data or more trying to figure this shit out.

Just ask me for fuck's sake, I'm more than willing to watch ads in exchange for a useful service!

Back in the olden days, we'd run ads in programming magazines, because that was our target audience. Lots of people bought those magazines only for the ads for programming products. Heck, I buy hot rod magazines because they have ads from vendors for parts I need.

With the web sites I run, I thought it would be a service to users to display ads appropriate to the content of the page. For example, if the page was about bash programming, you might see an ad for a book on bash.

But there turned out to be NO WAY to specify a category of ads to run on a site. It was all their algorithm. Hence my sites on programming would keep running ads for "Batman Returns". How utterly pointless and boring.

I had a page on the Revolutionary War, and the affiliate ads would be for travel agencies. What would have worked would be books on military history.

I eventually gave it up.

I.e. the ad should be on-topic to the specific web page, not specific to the reader. Not only would this be much better targeted, gathering profile data on the user would not be helpful.

> ad should be on-topic to the specific web page, not specific to the reader

Amen. I find myself returning to something I said 4 years ago about the limit of acceptable ads: https://news.ycombinator.com/item?id=10521930

Target the intent or interest, not the user.

I often find myself researching and looking at sewing supplies and equipment for my partners business.

A by product of these searches and this browsing is that I will often get ad's for products for women. Dresses is fairly common.

If I spend time digging around for sheer fabrics or lace I end up getting served ad's for women's undergarments. I have gotten quite a bit of side eye in office environments because of this!

The problem is that movies, travel, womenes fashion all have the most money to throw at marketing and they get the most play as they result in the biggest payout for the people running the ad's. It isn't any different than beer and cigarets being in EVERY publication that had a male leaning demographic.

> I eventually gave it up.

Sounds like a business opportunity. An ad network where the "user interests" are supplied by the publisher (and maybe user-agent, see below), not by the ad network. The network doesn't need cookies for the user, doesn't track the user at all - the analytics are based exclusively on the data provided by the publisher/ website, not the user.

This way, the publisher might target on the content (plus maybe known user preferences/ past activity on the site, while logged in). Sure, the publisher might also track the anonymous users & supply anonymous profiles, but at that point, why bother? Just use doubleclick like everybody currently does.

You could even imagine integrating with the user agents. If I'm interested in ads for cars, I might willingly supply the information about "what kind of ad I'm interested in". In my own "advertising profile", that I keep under my control. Sounds like having a user-controlled advertising profile is a win for everyone - both users & advertisers. I'm not quite sure why nobody does it.

> Lots of people bought those magazines only for the ads for programming products. Heck, I buy hot rod magazines because they have ads from vendors for parts I need.

Same goes for guitar magazines. Of course interviews are good, but the ads are how most guitar players learned about new pedals and other things we need (or want to need).

And ironically, computer magazine ads used to be the info source when it came to new hardware or software.

The problem is that Internet ads come with the risk of infecting your system with shit nobody wants, 20 trackers completely unrelated to the product being advertised, and the fact that they much too often rely on shady tricks like wobbling or blinking or pretending to be something else just to annoy the fuck out of you enough that you'll click in the hopes of getting rid of it.

That's why ads are such a problem. Not that we don't want ads. What we don't want are the risks and annoyances that the Internet has brought to them.

There were even magazines like Computer Shopper that consisted entirely of ads, and people would pay money to subscribe.

I think your comment is spot on. Have Google et al really missed that, or have they just concluded that individual tracking is better? It seems a pity as focusing on the content being served for ad cues is so obviously how it used to work.

The holy grail of advertising has always been proving that it works.

Delivering one ad that results in a confirmed purchase is much more lucrative than multiple plain page views, or even clicks. The result is twofold:

1- trying to predict who's ready to purchase something, right now.

2- trying to prove someone that's seen an ad actually made a purchase.

This is why Gmail for free made sense from day one. What better way to verify a purchase that by having access to emailed receipts. This is also why Google's been buying credit card purchase histories from Visa/MasterCard/banks etc.

Separately, there's predicting one of the big life events: college, wedding, baby, home purchase. If I'm remembering correctly, getting a jump over competitors on one of these can be worth hundreds of dollars.

There's the story of Target using purchase behavior to try and predict pregnancy, so that they could send targeted / trackable coupons to the expectant mothers. They sent one such packet to a 16 year old girl, who's parents immediately threw a fit about how inappropriate it was, and how dare they accuse their innocent little girl of being promiscuous...

Turns out not only was she pregnant, but Target knew before even the girl realized it.

The kicker: this was circa 1996.

All of this to say, targeted ads might make sense, but proven effectiveness pays more. For that, trackings pretty much required.

Thanks, that was a good read. Especially this! <Turns out not only was she pregnant, but Target knew before even the girl realized it.>

This sounds like a business opportunity for somebody, I guess there reason nobody is doing this is because there's probably a lot more money to be made from spying on users and gathering data on them. It's a pity because it ruins the experience for all concerned. The sooner people become more privacy aware and legislation like GDPR sees wider uptake, the better for internet commerce.

Amazon affiliate is probably the only way. But you need to hand pick everything..

I did resort to hand picking them for Amazon. Unfortunately, that requires constant attention.

Amazon, please let your affiliates at least pick the category! Then use keywords on the web page to pick the product in the category, hopefully at random. Seeing the exact same ad over and over is counterproductive.

The problem is there's a huge trust gap on both sides.

The advertisers figure that self-reported data may be faulty. In some cases it could be-- either maliciously ("I'm 92 years old and spend $600 per month on my phone bill!") or by omission ("I hadn't even thought about it, but my lease is up in 3 months and high-yielding car ads might well be useful for me")

Consumers won't be easily convinced they're being taken seriously-- if they have to go out of their way to customize ads, and then see equal or more ads than before, or non-laser-focused ads, their trust goes out the window. It's also going to be difficult to stand out in a sea of banners and say "we're the high quality ad you customized... right next to 32 click-your-state-for-mortgage-broker-lead-arbitrage banners."

I suspect there may also be a third distrust-- between ad networks and advertisers. If you ran ads for a product only on a really targeted audience, the figures may not look compelling to buyers. It might look better to say "This $10k ad campaign landed in front of one million weakly targeted eyeballs... and generated 40 sales" than to say "We spent $10k on a super-premium media blitz that hit 200 hand-selected individuals and sold 45 units."

Sales have a long tale. I recently bought a widget I saw advertised 3 years ago but didn't need until now. My parents still buy a sauce because the company ran a great at 25 years ago (the sauce is good, but they haven't tried the competitors).

If there’s really a trust issue on the offer side (advertisers) then why is YouTube asking survey questions about my demographics and related nonsense?

I am a consumer and I don’t want the cognitive pollution associated with having images and sound shoved in my face. Full stop. Separately, worth mentioning, is the fact that the idea of introducing purposeful limitations deliberately to get people to pay more is also evil (I can play YouTube on my desktop and switch tabs, and on iOS for ex. it stops playback so you can buy YouTube Red).

EDIT: Offer side/bid side is ambiguous since the model is “inverted” with advertising. I’m leaving my comment as is, but its probably better to say “bid side” in this specific case.

> Separately, worth mentioning, is the fact that the idea of introducing purposeful limitations deliberately to get people to pay more is also evil (I can play YouTube on my desktop and switch tabs, and on iOS for ex. it stops playback so you can buy YouTube Red).

This is a fascinating claim to me. Selling access to a service is presumably okay, but having a limited-features version available for free use is "evil"?

It isn't the access to the service that is evil.

Rather, that you get a lesser service on a different platform unless you pay. Consumers expect the same experience regardless of platform. When that experience is free on some platforms but paid on another, it frustrates.

ability to use on different platforms = a feature

It's weird to me to call this a feature, a few years ago it was a given. YouTube had to go out of their way to break playback on mobile to offer this "feature".

I consider it a bug they've added purposefully and they request each user pay for.

It demands that engineeting resources are set aside and costed i each case. Both for implemtation and maintenance. It is definitely a feature whether youd been getting it for free up to now or not

Feature or bug, that's ultimately up for the market to decide.

Pretty sure the market has adjudicated on YouTube at this stage LOL

That click-away behaviour is easily addressed by using VLC or mpv to play content. Either with video (and in PiP mode if you like), or audio-only (which I generally prefer).

>I'm still waiting for the option to see the ads that I want to see.

This is not the revenue-maximizing ad. Expected value for the company buying the ad is proportional to the product of conversion value and likelihood to convert. "Ads that I want to see" corresponds to the latter, but is completely uncorrelated to the former.

For example, advertisers would rather sell a 1% chance of generating a valuable asbestos lawsuit lead than a 100% chance of generating a $2 lunch restaurant lead.

I don't know about you, but I'm overwhelmed by requests to fill out surveys. Every time I buy anything they want me to fill out a survey or leave a review. I ignore them like spam calls.

Figuring out that things went basically okay from logs is much less intrusive. Surveys should be saved for more important things.

fwiw, I work for one of these services.

For a large consumer app (and actually, that's less and less reserved to the larger apps only), you can expect that every time you see a widget, or tap anywhere in the app, it is going to trigger some kind of analytics log.

The goal is not to determine your profile in order to sell you ads but to understand how people use our product (in aggregate).

Let's say you have a checkout feature. you need to add an address, payment method and tap checkout.

If a significant % of your users bail out of the checkout flow in the address step, there might be something you want to investigate.

We also do user studies, they are super useful but are harder to generalize.

If they asked me, I would tell them that there is no such thing as a relevant ad. Even if I'm shopping for a specific product, if an ad gets through my defenses I stop shopping. If I see an ad for your brand, I will avoid your brand.

Pretty much. Ads tell me that something exists. I don't entirely reject products whose ads I happen to see, but I do rate them inversely proportional to the scope of their claims.

And it's not just obvious ads. You can't trust online customer reviews, because there's an industry devoted to spamming them, both positively and negatively.

And you can't trust third-party reviews, because they're often disguised press releases. Or biased by payment. For example, I've been told that many sites that review VPN services basically auction their rating slots.

Negative reviews are also used put down competitors' products, blackmail free products or services, and in so many other ways that only humans can invent. The reviews are less reliable every passing year.

Some platforms have awakened to this, so they have rules or guides what to do when customer tries to blackmail free stuff with negative reviews. However, not all.

The most useful ads are for something I didn't even know existed before seeing it. I've often bought things to improve my life that before seeing the ad I didn't know existed.

You'll have a hard time shopping for groceries then. Besides displays with ads on them, brands pay big money for preferential placement on shopping aisles. Same for the order of product listings on Amazon etc.

I have a shopping list and a blindfold.

Honestly, I want a subscription service that allows me to see most major websites without ads.

I think someday someone's going to realize just how silly the advertisement game is, and as long as the payment structure is in place, we can get a much better web experience.

For example, may of us pay a small monthly fee for Netflix. I'm sure that a small monthly fee could add up to more than what most sites make from ads.

> think someday someone's going to realize just how silly the advertisement game is, and as long as the payment structure is in place, we can get a much better web experience.

You're not the first, or tenth, or millionth person to think of this. Hell, even just limited to HN, micropayments and general content subscriptions have been discussed for a decade. Consumers are in a way that equilibrium where they don't want to pay for web content (esp text web content), and the path to getting them to the equilibrium of paying without thinking about it (like with Netflix or power) is unclear.

It's not just theoretical: Companies like Google have also been experimenting with this for yeaaars, to diversify away from the risk (whether regulatory or technological or otherwise) of relying on ads as a primary revenue source. There are complications beyond consumer behavior, like bringing the colosally complicated ad ecosystem under a single payments system (since nobody wants to pay for a service that only removes some fraction of ads from the web).

Really I think micropayments are a fool's errand because it is inherently too "fraudable" not in the legal sense but that both sides can easily be left unhappy with no recourse in a many to many relationship. No gates and it is just another donate button. Add a gate and clickbait and bait and switch becomes ten times worse. A reputation system would be dystopian and doomed to failure.

Personal subscriptions for websites might work for those who are big enough but "big enough" is far larger than entities which probably should have been trustbusted several decades ago and it is still too fragmented for that purpose. Not helping is that they are always too greedy in price expected from ad viewer vs subscriptions - because of how few bother.

> Consumers are in a way that equilibrium where they don't want to pay for web content (esp text web content), and the path to getting them to the equilibrium of paying without thinking about it (like with Netflix or power) is unclear.

You disprove yourself by mentioning Netflix. The path is absolutely clear: Customers are willing to pay for added value that's proportionate to the cost.

The problem for publishers is they do not add any value that would justify customers paying enough for their content. Few people will pay for a newspaper subscription when there are 10 other newspapers offering 90% of the same content for free.

There are models that work, e.g. Patreon, but those usually don't scale up to, say, the Washington Post or CNN.

> You disprove yourself by mentioning Netflix

This isn't how equilbria work. Netflix was a superior product to piracy in many ways: no perceived legal risk, reliable access, high quality guaranteed, way better ease of use. These barriers were high enough that plenty of people didn't pirate at all and stuck with nonsense like DVDs for way too long, so the incentive path pointed smoothly towards switching to Netflix, a Pareto improvement for non-pirates and a fairly easy trade-off for pirates.

There's no such path for web content: adblockers are unquestionably legal, easy to set up, provide a better experience, and even non-users of adblockers have a trillion non-paywalled sources in an ecosystem where it's tough for strong brand loyalty to survive en masse. What advantages do you imagine a paywall option offering to people when their alternative is better in almost every respect?

> There are models that work, e.g. Patreon, but those usually don't scale up to, say, the Washington Post or CNN.

I think things like Apple Pay in Safari (not that I use either of them) might enable some movement here - it'll be easier to get people used to paying for content in 'micropayments', i.e. not 'subscribe for £2pcm' but '10p for the rest of the article'

(One) problem with micropayments is that you're then in the situation of constantly being in a state of "Is the rest of this article worth 10p to me?" Years ago, Clay Shirky described this as a problem of mental transaction costs and there's a lot to be said for it.

What I think is more likely--but still pretty speculative--is that an aggregator (like Apple News) could create a sufficiently large stable of publications to offer as a subscription competitive with the handful of pubs like the New York Times and Wall Street Journal that are strong enough brands to go their own way. One thing that's very unclear is whether the mass market is willing to pay the cost associated with that subscription. Probably not.

Today's evidence suggests that people are generally more open to subscriptions than pay-as-you-go for content. Music in particular has pretty much transitioned to subscription for the most part.

I think this risk is oversold as a steady-state problem: nobody feels this way about every tiny marginal bit of electricity use, because we've habituated to a longer cycle of feedback (ie power use was high this month, try to be more mindful on average).

That being said, I think this is one of the biggest hurdles in the transition from where we are to a payments-based system.

I could easily some universal way of paying to not see ads could be both more profitable for websites and less expensive for me. Because what I'm currently paying for mobile data and bandwidth at home in order to compensate for all the waste introduced by telemetry and ads probably amounts to a good bit more than what websites are making by serving me ads.

But at this point the trust is so broken that I probably wouldn't pay for it even if it did exist. Because I'd expect whatever beacon is being used to say, "Paid up, don't serve ads" to just be used as another way to de-anonymize me by the less scrupulous advertisers. Which, as far as I can tell at this point, is pretty much all advertisers.

This has been tried several times. Google Contributor and Blendle are some examples. We had our own. Doesn't work because it's the paying-for-content part that people have a problem with. There's just not enough who are willing to pay the real costs of content. Many even get upset thinking that their internet connection is already "the internet".

Yes! Exactly!

As I understand it, sites don't earn very much from each ad view. So to get substantial income, they need lots of traffic and ad views.

And this gets implemented through a hugely complex process of real-time data sharing and bidding. It's ~transparent if you have a fast uplink. But if you're using Tor browser, you can watch it play out, all too slowly.

So just replace that complex process with a simpler process of identifying the subscription service that the user employs, and adding the page view to their tab.

In order to match current income from ads, the cost per page view would likely be very small. Perhaps $0.01-$0.10. And there could be a mechanism for adjusting that cost based on the local cost of living of a given user. Perhaps through a parameter pushed by the subscription service, which it would obtain in some more-or-less anonymous fashion.

I'd like it too. But here's a kicker. You can pay a small monthly fee for a service to offset the revenue they lost by not showing ads. But, the service may opt to show you the ads anyway, and double their revenue. Especially that willingness and ability to pay already signals you're a better than average target for advertising.

Unless there's regulation against advertising-based business models, nothing will change, because competitive pressure will always push towards free+ads, paid+ads, and/or free/paid+ads+data collection.

Yeah, that seems to mostly be a coordination problem.

As in, I'd pay $X/month for something that does that for most sites I frequent. But I'll never sign up for 100 different sites individually.

Same thing for newspapers.

A majority of people don't care enough about ads for a service like this to work and those that do use adblockers.

You're describing Brave.

>why not just ask them?????

Mostly because revealed preferences. https://en.wikipedia.org/wiki/Revealed_preference

Revealed preferences are bullshit if you manipulate the environment so that people reveal the "preferences" you want them to reveal.

They're always bullshit. Even if you don't try to manipulate people all you measure is their current habits.

I've gotta say, I agree.

I don't ever need to see car ads, because I live in NYC. Nor for medications for conditions <x, y, z> which I don't have. Nor for that Amazon product which I already bought.

I am so happy to spend three minutes filling out my interests profile. In fact, I already did it for Google when someone pointed out you can. But for all the many ads served by other ad networks, no dice. :(

You can also run ublock and never see ads.

> All this effort to distill user behavior and user intent from logs... why not just ask them?????

Because users lie, intentionally and unintentionally?

>> I'm still waiting for the option to see the ads that I want to see.

It's not about what you like, it's about what you need or think you need.

Probably is because your interests change over time. And because they want to profit on what you need today. You might like cars, but you don't buy cars every day, just as me that I don't purchase guitars every week.

I think they want to profit from your immediate need, and that requires spying on you, your messages, your browsing history, etc.

> Probably is because your interests change over time.

They do, and the sites I visit change with them too. Visiting a site about something is as clear and unambiguous signal about my interests as you could possibly get.

The reason it's totally disregarded is, I believe, because it's simpler and more profitable to run ad networks as a market. Neither the publisher nor the ad network really care what ads are being run, they care about maximizing profits. And some advertisers (e.g. universally appealing product categories like clothes and movies) can easily outspend niche sellers.

As it is today, advertising is an act of malice, so don't expect anyone to care about the user end of the equation. You can get a better ROI by showing more user-aligned ads, but you can also get a better ROI by doubling down on the surveillance capitalism, and the latter requires less coordination.

You dont ask users about their intent, you monitor how they behave. Thats how google makes their money.

Users are notoriously bad at being able to explain what they want, or even knowing what they want.

This is why UX/usability studies are the gold standard of delivering value: put a user in front of your tool, give them a task and watch the actual actions they make to complete the task. This has been the norm for decades.

The author touches on the cargo cult belief that data has inherent value, so collect all of it, hire a data scientist or ML expert and point them at it, and watch the value flow!

Seems to come from the same place as the belief that the Cloud magically makes everything resilient and scalable without any extra effort on your part. Just put it in the cloud, and then give your CTO a bonus for suggesting the cloud, and suddenly you don't need to worry about sysops.

Don't get me wrong, "log all the things" is a good place to start when you need to figure out what's actually worth logging - but it needs to be followed by a rigorous prune.

Otherwise your data-lake turns into a data-swamp, you collate a lot of noise that makes it harder to find signals, and people eventually end up spending a lot of time trying to figure out what's actually used, if any, when Hadoop gets full or the S3 bill gets too high.

I was really hoping this would be about the lumber industry.

Theres a joke about functional tables of logs in here somewhere, one that unifies forestry, mathematics, programming, the bible, and DevOps.

(And perhaps explains why my spell corrector camelcased "devops")

> ‘log’ originally denoted a thin quadrant of wood loaded to float upright in the water, whence ‘ship's journal’ in which information derived from this device was recorded.

How do you know if a robot has been stealing your wood?

Check its log files.

Hmm, reminds me that a loooong time ago I used a Linux distro that had a log viewer, and the icon of the log viewer was a pile of lumber logs.

Thanks to GDPR, we're no longer allowed to count the rings to reveal PII.

Not to mention the crazy restrictions around acorns

They're nuts.

This brings up really good points. One of the best practices going forward is to minimize storage (and logging) of any personally identifiable information.

Under GDPR, IP addresses can be considered PII, so it makes sense to set up an anonymizer for nginx ip address logs. There is a great Stack Overflow answer on this: https://stackoverflow.com/questions/6477239/anonymize-ip-log...

But also there's some app hygeine involved. At least one of the recent "data breach" notifications involved not an actual leak of personal information, but unsanitized logs containing personal information that should not have been shared intra-organizationally. I forget the company that did this, but they notified as if it had been a breach even though passwords had just been logged internally.

When testing it's convenient to do stuff like ``` console.log('username: ', req.body.username); console.log('password: ', req.body.password); ``` but it's all too easy to forget about it when you're working on a million things. So a big part of the solution is mindfulness (do I _really_ need to log this?)

At least one? Almost every one of the recent "plaintext password" headlines, of which there has been half a dozen or more, were due to that. And I'm sure there are many more cases happening right now without being noticed. It definitely is big issue.

And no, I don't think in most cases it's forgetting to remove the `console.log(req.body.password)`, but rather having a much wider `console.log(req)` which you didn't realize contains (or could in some other code path contain) a password. Or some log statement much deeper, 2-3 layers of abstractions away, logging some struct passing through the system, which happens to contain PII.

It definitely isn't a trivial issue with a simple solution, as some people commenting on such headlines seem to imply.

Its not that hard when you taint structs with PII, so that the log library faults if it tries to log it.

Ah yes, just handle tainting/laundering all data that flows into a system which is hooked up to thousands or hundreds of thousands of external data sources. Not that hard at all.


So your alternative is... do nothing?

Of course if your whole system is written with this in mind, it's doable. But many code bases are huge and fixing them isn't trivial.

This: Under GDPR, IP addresses can be considered PII,so it makes sense to set up an anonymizer

is wrong. First of all, PII is a legal definition from mostly the US. The GDPR talks about Personal Data, which is different. If any consultant is talking about GDPR and PII, he is confused, stop listening. Have fun reading the details here: https://gdpr-info.eu/art-4-gdpr/

Which doesn't matter all that much, because the next part of the line is way to broad:

Is an IP adres Personal Data? Maybe, if you can use it to track an individual. But if you arent an ISP and don't actively try to identify anybody with an IP, stop worrying.

Next question: Do you need it? In general, using them for keeping the website running is normal usage.So no need for consent. Using them for attack prevention might actually be an industry best practice, and then the GDPR requires you to keep them.

Next question: If you need them, how to keep them safe? Throw logs away after a while, encrypt backups, etc...

The GDPR has been hijacked by consulting companies to extract money from everybody, so they do their utmost best to sow paranoia with all kinds of weird urban myths. DOn't believe it. Basically, do the normal IT best practices and stop worrying.

"Under GDPR, IP addresses can be considered PII, so it makes sense to set up an anonymizer for nginx ip address logs."

One thing I wonder about is what you would do if, say, you have an abuser on your site that you need to ban due to behavior detected after the fact through a log file.

If one needs their IP in order to ban them, but their IP is anonymized, what do you do?

There's nothing wrong with retaining full IP address information for the defined purpose of operating your site. Combining such information with other data to perform, say, ad targeting would be more problematic.

IP alone is not considered personal data. If this is the only thing you store about a user on your website (for example for a static website), don't bother and keep it as it is.

If you store anything else about the user (their firstname/lastname) and can make a relation between this and the IP (e.g. you can see that this IP went to the page myprofile.php?id=438098 at 23:10 yesterday), then you should already have somewhere where the user can see why you store their firstname/lastname. Just add "IP" to the list of data stored, for the purpose of maintaining your systems safe and accessible, warn that you'll store it for 30 days, because after the logs are purged, and then you're fine.

If you are a B2C site then banning based on IP in today's multi NAT layered internet with non-fixed IP's and mobile devices is probably going to be wildly ineffective and cause far more collateral damage than you think.

Our current policy is to retain basic access logs, including IP addresses, potentially indefinitely.

In GDPR terms, we have obvious legitimate interests in being able to identify repeat offenders trying to abuse our system in some way, in being able to identify recurring problems with how our systems are operating, and in tracking long term usage patterns. These interests combined with the very low risk of any adverse consequences should any of the relevant logs leak make our policy of indefinite retention in this specific instance compatible with GDPR in our view.

Incidentally, we have in fact identified repeat abusers returning to our site several years later based on their access patterns and IP addresses as recorded in logs, so there is even an objectively demonstrable long-term threat should anyone ever want to question this policy.

GDPR is supposed to allow retention of certain data for security purposes.

There is also the fact that GDPR cannot overrule industry specific regulations that mandate long-term retention.

I work in gambling industry, and we are required by the regulations to keep ALL user information, including the KYC documents they submit, on file for a minimum of 5 years after their last activity. If you are looking for toxic data stores, this is among the worst ones there is. Limiting access to that data is crucial, and making sure it's not misused is mandatory.

It could be worse. There are domains with more demanding data retention requirements: insurance and consumer finance in particular.

I think you block the IP without correlating it to a given user. Pseudonymity seems like one of the most realistic ways of becoming GDPR compliant without losing the value of data.

One way hash of the IP.

The search space is so small you can simply create a table with all of the hashes and use that to reverse the hash.

Mandatory IPv6 could finally be here! ;D

You could always salt it.

Salts are only useful when you already know which hash to check e.g. because the user supplied the username that picks which password hash to check. When matching an IP against a list of salted hashes, you need to hash the input with every possible salt to compare against all hashes in the list. So for performance reasons, it's probably not feasible to use more than one salt (or a small number of salts). Then it's again very possible to reverse the list of salted hashes because the search space is number of all IP addresses times number of salts used, which is way less than the number of all hashes.

How would that address the problem? Salts don't make hashes any less reversible.

Yeah, we're GDPR bound and henceforth compliant, and nope, a one-way hash isn't sufficient. We just drop the last octet and accept it. I mean, we can still say 'this impression was from a user in Strasbourg', but we enrich the record with that enroute through our pipeline, about the same time as we drop the last octet of the IP.

The 5 user Nielsen test referred to in this article is quite inaccurate. If ( and that's a very big if) the users are IID, then yes, 5 users is all you need. But your users from Russia aren't the users from USA aren't the users from China etc. Even if your userbase belongs to single country, there's differences between CA user vs TX user vs NY user etc. Further, the analysis isn't static in time! So you as a single user will be a different person tomorrow because you are more familiar with the software, or your mood is better, or your worktable/mind is less cluttered so you can pay more attention etc. In other words, world isn't multinomial coins with fixed head probabilities. Here the coins are people & the probabilities change over time. So the only sufficiency statistics are order stats. Hence logs. Nielsen adds a massive caveat "The formula only holds for comparable users who will be using the site in fairly similar ways" - its possible to find such people in very homogenous groups. Like if you have a GRE saas & target all only the white college kids taking the GRE, hopefully 5 white kids is enough. Now you bring in black & brown & hispanic & chinese & so forth...maybe 5 of each. Or maybe you want to separate by sexes, so 10 of each...it gets complicated very soon, which is why its much simpler to just log everything/everybody.

> The 5 user Nielsen test referred to in this article is quite inaccurate

> "The formula only holds for comparable users who will be using the site in fairly similar ways"

This is the article where Neilson breaks down what is being tested, and why it is statistically relevant.


Neilson is looking to solve HCI(1) and Human Factors(2) issues - and most of these are byproducts of having to have a deep (insider) understanding of a product and that bubbling up into your UI. You are going to catch a lot of errors that fit the adage "can't see the forest for the trees". Having sat through a LOT Of these tests, you will pick out user frustrations, and reasons for product abandonment that would likely be NON apparent in a log.

Your examples of US/China, TX vs CA and GRE with race and class MIGHT be relevant but it is going to depend a whole lot more on what your building. The problem is there are other means and places where these issues might manifest, and again use testing would tell you a lot.

If we were to build a VR game that used a chopstick like interface, and test it only in china, we would likely think that we had a good product. If we find out later that "this isn't selling in America" then testing in that demographic group would quickly give us the insight that people lack the muscle memory to use this intuitively. There isn't any log in the known universe that would give us that clue, and "test here" can (and likely would) be gleaned by other means.

When you get past HCI and Human factors log data can be useful, and be a contra-indicator of the results of formal testing. Given a choice between A and B in a formal setting may give you one set of results even with a large sample size, but real world behavior turns out to be very different. This is akin to people slowing down when they see a police car - but driving fast when one isn't present or kids acting differently because they know someone is watching. We aren't discussing UI and UI interactions were now discussing human behavior, and preference. I can't tell you how many times I have seen the non preferred solution be the winning one in an A/B test, but I would generally bet against what the group likes and pick the most garish solution as the winner.

These behavioral types of tests can only really be driven by logs, by people being themselves and "feeling" unmonitored, and accurate demographic (to your point) slicing and sorting. En mass people are far more predictable than they would like to believe. Were delving into something more along the lines of Asimovs Psychohistory(3) as I don't think these sorts of statistically predicable behaviors have been given a formal name.

1. https://en.wikipedia.org/wiki/Human–computer_interaction 2. https://en.wikipedia.org/wiki/Human_factors_and_ergonomics 3. https://en.wikipedia.org/wiki/Psychohistory_(fictional)

Addressing the value (and limits) of sampling:

Yes, statistical sampling is a hugely useful practice, and is frequently used, at least by those who are familiar with its power and capabilities.

Depending on what you can see, it may or may not be particularly useful. For activity logs, you are getting a bunch of relevant information, though if you stick to just sampling log records, you may miss useful information, such as paths through a site, session data, and the like.

In doing analysis of the scale and scope of usage and activity of the late and unlamented Google+, I had the opportunity to sample based on profile IDs, which Google had helpfully stashed in a set of robots.txt sitemap files, back in 2015. More recently, when seeking information on the number, size, and activity of G+ Communities (effectively: groups), I could perform a similar sampling based on the group IDs, also provided via sitemaps.

For a basic assessment of how many active users and groups there were, a small sample, as few as 100 or so IDs, selected at random, were sufficient to give a general feel. But there's a lot of variance hidden in 2 billion registered users (as of 2015), or the 8 million Communities existing as of January 2019. And for detailed measurement of the most active users and groups, a very small fraction of the total (0.1% of users, and the top 50 or so of 8 million communities, or 0.000625%), the releative sampling population wasn't the total user or group count, but that small subset, randomly distributed throughout the whole, comprising that sample of interest.

To find the very most active users and groups, in other words, you have to sample a lot of datapoints.

(Mind: if I'd had log data, they'd have fallen straight out of that. I didn't. Which is itself another lesson: in most cases you're interested in activity and not population as a primary analysis variable.)

Given my tools and methods -- requesting URLs and scraping, from a desktop system over residential broadband -- there were limits to the amount of sampling I could do. 50,000 profiles were doable in a couple of days, but a larger pull would have scaled linearlly in time. For Communities, I did a largish pull based on a minimum level of resolution I thought would be useful, based on 12,000 (again, randomly selected) Communities.

In the end I lucked out as a third party was able to provide a comprehensive dataset of all 8 million communities and summary metadata, from which I could validate my earlier sample-based methods.

But yes, working with hundreds or thousands of records, rather than millions or billions, often makes sense, is useful, and requires vastly fewer resources (compute, time, bandwidth).

For getting a rough idea of just

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact