
Ask HN: How to deal with GDPR / cookie notices in the context of a crawler? - mgliwka
To comply with the new European legislation many websites put a GDPR &#x2F; cookie consent notice in front of their websites. There are different implementations of this. While some are only implemented as modal covering the website or bar on the bottom of the screens (in both cases right next to the original content), other implementations redirect the user to a totally different (sub-)domain or even hijack the request and show the consent form instead of the requested content (on the same URL with a 200 status code).<p>The latter ones present a issue to my crawler. I cannot access the content of the page without accepting those notices.<p>Things I&#x27;m considering to bypass those notices:<p>* US IP address (easy to implement, but some websites also display those notices to US IP&#x27;s)<p>* Heuristics to detect those notices and accept them programatically (takes some time to implement - while a couple of vendors (i.e. OneTrust) offer off-the-shelf solutions which are easy to identify and automate, there are also many custom made solutions, so the system would need understand the concept of a consent form and how to bypass it - some forms only require the press of the right button, others involve checkboxes&#x2F;radio buttons). To collect test data one solution might be to visit a set of websites once with an US IP, once with an EU IP and&#x2F;or with different user agents (browser or googlebot).<p>Do you have any ideas how to approach this problem? Or are you even utilizing some techniques already and are willing to share them?
======
jjcm
Side question - how does HN feel about the cookie/gdpr notices in general? I
personally feel that while I like the purpose they have, they just feel like
spam at this point. I kind of expect most websites to use cookies, and if I
didn't want them to I'd probably block them with an extension. As for the GDPR
notices, are these going to be persistent forever? It feels like the web did 5
years ago, except instead of viagra ads I'm getting GDPR and cookie popups on
every site now.

Overall I feel like the intent of these is correct, but the execution is
terrible. I'd much rather have say a badge in the address bar of the browser
(similar to the https badge) saying a site was gdpr compliant and used cookies
then a popup everywhere.

~~~
munchbunny
(For background, I was responsible for figuring out GDPR compliance in a past
job, so I've picked through the literal text and a lot of interpretations of
it.)

I think there's a spectrum, but I think the vast majority of the cookie
notices I've encountered have been implemented in a sneaky way that runs
counter to the spirit of the law.

The spirit of the law is that sites should explain how the data is used in a
way that the layperson can understand, and it should be clear to the layperson
that (in most cases) the site is legally obligated to give you a way to say
"no."

As it stands, most GDPR notices give you a choice between "OK" and "more info"
(where the "no" option is hidden) or between "Yes" and a subtle X in the upper
left (because upper right would be too obvious). And they don't tell you that
by clicking "Yes", you are consenting to having your information brokered and
sold to innumerable advertisers.

I think that's a dirty UX trick, and that for the purposes of getting consent
the "no" button actually should be an obvious "no" button.

The reason the badge isn't possible is that GDPR did the right thing to
enforce privacy by default, and all of the sites that want to monetize your
data for advertising have to get your explicit consent. So you get all of
these notices because they have to and want to ask you. Were it not for
advertising, you GDPR would have been a pretty peaceful transition with a few
exceptions like "oh yeah we keep crash reports, you're okay with that right?"

~~~
ankit219
Add to it that most sites make it impossible for you to use the site without
enabling cookies or if you dont consent to them storing cookies.

~~~
munchbunny
GDPR actually draws a reasonable line for this. Using cookies to remember your
login or browser sessionis fine, and you can probably not even ask to do this.
Analytics is a bit trickier, but as long as the data is aggregated and no
"personal data" is collected, you're on the safe side of the gray area.

The problem is that using cookies for advertising _does_ require you to ask,
so how do you conflate the two?

I've seen some sites phrase it like "we use cookies to personalize your
experience". You can interpret that to mean session cookies, but you can also
interpret that to mean marketing.

I hope EU regulators end up actually going after people for doing sneaky crap
like this.

~~~
ankit219
Indirectly the solution/workaround the sites have come up with is saying, "if
you want to use our site, you will have to agree to our cookie settings. Else
don't use the site."

I have cookies disabled by default. Whenever a site does not work without
cookies (news sites, travel sites, and blog sites especially), I open them in
a guest mode. Still feel, its a bit tedious thing to do, but works for me.

~~~
munchbunny
The irony of course is that GDPR actually forbids this sort of "all or
nothing" deal under specific circumstances. But it's a gray area, so sites
just say "accept how we use cookies."

Oh well, it's still better than before.

------
ndarilek
I have a related question. How do you bypass them in the context of an RSS
reader/podcatcher? I was building a service to parse some podcast feeds into
JSON, and noticed they were failing on NPR podcasts. Pulled up the URL fine on
my laptop, but it failed in Hetzner.

Of course, it failed because it was getting some sort of GDPR page at the
podcast feed URL. I'm wondering if there was some way around this, because
it's not like podcatchers can opt into something via an RSS feed...can they?
I'm pretty sure I passed headers only accepting feed content-types, but even
that wasn't enough.

Sure I can host elsewhere, but I just didn't care enough about the project to
do that. But if there's a way around this, then I might pick it up again.

~~~
Rjevski
That's a bug on their side. The RSS feed should not require any consent since
it doesn't have tracking (I don't think you can even embed trackers in an RSS
feed). If they want tracking they can link to outside pages in the RSS feed
and put the consent notice on those before displaying the content that
requires tracking.

------
ddebernardy
Manually accept (or reject) the tracking once, and then pass the relevant
cookie as part of your crawler's request.

------
eberkund
Why does every website need to create its own UI for this? Whatever happened
to that "Do not track" browser setting? This should be equivalent to rejecting
all of these notices automatically.

------
Scaur
Thanks for asking this question, I'd like to learn about this too.

------
dogma1138
Sounds like a possible use case for a mechanical Turk for those that do a
redirect popup and not just a forefront dom object while loading the actual
content behind it.

------
highace
Don't use a server or IP based in Europe. Problem solved.

~~~
watwut
If you have customers in EU or any business with EU, you did not solved the
problem. The law applies to users regardless of where the server those users
connect to is located.

~~~
ddebernardy
I believe he meant to not crawl using an EU-based IP address. Which would make
sense if not for the fact that many sites are serving the GDPR notice to all
users - EU or not.

