See also: http://www.apple.com/customer-letter/
'Google doesn't post a public list of IP addresses for webmasters to whitelist. This is because these IP address ranges can change, causing problems for any webmasters who have hard-coded them, so you must run a DNS lookup as described next.'
> News-specific tag definitions
> Yes, if access is not open, else should be omitted
> Possible values include "Subscription" or "Registration", describing the accessibility of the article. If the article is accessible to Google News readers without a registration or subscription, this tag should be omitted.
One of these things is going to happen:
(1) They end this "experiment."
(2) They stop serving Google the full content. (And see their rankings drop accordingly.)
(3) They get delisted for cloaking.
Google has end to emd control over some users internet experience, and much of it in other cases. They own:
* 100s of thousands of servers
* domain registrar
* ~50% of web browsers in US.
* code CDN, FontService
* define web standards
* hundreds of millions of emails.
* CA implementation
* ISP infrastructure
* Develop software for a large part of thr mobile ecosystem.
* decide what you see when you go to search (most search copy google, buy results, or both)
* also many of the web beacons and advert targeting.
* oh, and the largest collection of video and images in the world.
So when you say, just do what they say or get deindexed, and you present it as if that is reasonable(not just you but the collective you) I just think I must be insane.
I mean, assuming google is good (i fo mostly) doesn't mean I would let them become the entire internet.
Real question, if google were to disappear vs. the "too big to fail banks" that would have gone under, where a case could be madr for a few certainly failing, what would have bigger impact today?
Tl;dr everyone cares about single point if failure except at the macro system level: finance, banking, healthcare, etc
In my experience, all those requests to api.recaptcha.net get forwarded to "www.google.com"
My experience has been that if a user for whatever reason cannot access the IP du jour for www.google.com (www.google.[cctld] will not suffice) then that user is prevented from using the myriad websites that rely on recpatcha.net.
Now, I could be wrong and maybe there is something I am missing, but in my experience this is a sad state of centralization and reliance by websites on Google. Quite brittle.
While I consider Snowden a proper hero it is almost a certainty that this could happen to a "friendly" entity like google. In that, the NSA likely has some top programmers who could get a job there and compromise something, learn enough info to find a vuln, or pass data out. This is of course making the massive assumption that they aren't already cooperating at a system level either voluntarily or involuntarily.
As you can see, as search deteriorates google is motivated to (in my opinion benevelontly) use any means neccessary to continue to fund their larger goals of a connected and automated techno-utopia. However, they will be tempted to leverage what amounts to almost literally 50% of the worlds thoughts to build systems to make short term profit while pressing forward.
Just a few that immeadiately come to mind:
* using their network to control an entire alt coin ecosystem
* using data trends to trade on global markets
* start a competing business and deindex or penalize a competitor.
* build skynet (kind of joke)
So basically, those scenarios are fairly suboptimal and I could certainly imagine that several thousand genius with knowledge of googles systems AND the worlds data could likely be profitable quickly.
Can you expand on why search is deteriorating? Honest question. I certainly don't see the relevance of search sinking, nor can I see any competitor in the market that could even come close to threatening Google's monopoly on search.
Search is in fact massively expanding as tooling and machine learning capabilities increase do to research and hardware. Similarly, Google, Apache and Elastic have many open source libraries for search, indexing, storage, caching and serving which allow for scalable architecture. Also, outside of the things above like crawlers and Hadoop, Solr, etc. Microsoft and Google have open sourced JS parsing engines and Node as well as the Electron browser, Brave Browser and Node Web-Kit are built on technology that leverages this.
So, as someone who is not an information architect or data scientist, it seems like we have an ecosystem where a scaled down version of google can be built and trained on the per user basis and completely private.
The solution I have hashed out in more detailed elsewhere, but on to the actual question, is search deteriorating?
* My results seem to be worse and I have much less control than before. Anecdotally, it seems as if qoutes and boolean ops are respected less.
* Discovery is a huge issue that Google solved well, now we have the opposite conditions but the same problem. There were very few sites and it was hard to know what content was on them. Now there is too much content.
* Without fine grained control over my search I can't get make destinction between Information vs. Links. This is needing a date or well accepted piece of content/documentation vs. finding some new apps or non-facts. DuckDuckGo is quite good for some things and Google is good for others. Sometimes you may want to eliminate all wordpress sites (many content mills built on this) or remove Alexa links from your queries if you need to discover something.
* Need to eliminate sites and content I don't want. NOT something like a content filter for porn or whatever, something like:
never return results from %news-websites older than 30 days.
never return content posted %before nov-2014
remove links from [%Alexa-1000, %Wordpress, #TLD(.co,.co.uk)] for reputation ranking
decrease links from [%Alexa-1000, %Wordpress, #TLD(.co,.co.uk)] by [80%] for reputation rankings
* Google provides no versatile results.
* Many pieces of well tested software would make it easy(for the right group of software engineers) to silo crawl data and parse it with a users own parameters.
There is a way to set up this ecosystem that I have been thinking about, but to conclude:
Google is fucking awesome and really really good at what they do. Search experience is getting worse in terms of control but tooling is leagues better. Google sees this and is working on loftier goals internally (I imagine), thus it has split up into a meta-company that will work as an accelerator for growth while capitalizing on some verticals like the Real Estate thing they are doing or Delivery they just announced to keep short term profitable before they can achieve their end goal. Also, advertisement is an unsustainable paradigm for internet growth for many reasons.
The DOM is super fucking horrible.
The Parsing engine is a great fix for a fucking horrid DOM.
DNS security is fucking horrible.
The Next google will be a browser & an optimization marketplace.
I don't think compiling to web assembly makes sense but I could be totally wrong. I think something like Docker would provide a sandbox that would let people get performance and versatility and sidestep the entire DOM, only run JS, need Apps vs. Content thing. No idea how this works on mobile though.
> So, as someone who is not an information architect or data scientist, it seems like we have an ecosystem where a scaled down version of google can be built and trained on the per user basis and completely private.
I'm skeptical. Search is a huge problem just because of the bizarre amount of resources you need to throw at it. I can't afford to build my own datacenter(s) to host my custom search system. There might be huge advances ahead in terms of storage capacity on commodity systems, I don't know, but in any case, I'm only one person crawling webpages versus millions of people (and bots!) creating them.
You implicitly address that a bit later by talking about "silo crawling", but again, I'm skeptical. The only silo structure that I can easily see is large sites with useful content like Wikipedia or StackOverflow/StackExchange, but I'm likely to come across these anyway in any given domain, and I can easily filter for these on Google today, e.g. "site:en.wikipedia.org". The more interesting and hard part is the long tail of small, sparsely interconnected websites which might contain unusual insights but are unlikely to come across with a silo crawler (or with Google's current UI, for that matter).
> Search experience is getting worse in terms of control
I guess that's the classical problem of scaling a product to a large audience of mostly technically illiterate users. Maybe Google is learning from Apple, whose UIs have for a long time favored ease of use over giving control to the user.
I have been thinking about this, and I have come up with some ideas, other people obviously would provide more ideas and a solution could be reached, some of my thinking:
A service that behaves like AWS/GIT/DNS/Google combined.
* A user runs the service and indexes data it receives and there is a central repository of information than a user can contribute to or not contribute to. Initially, a new user would either buy a crawler or cache of data from the market and store it locally or on a private server bought as part of the service. The blockchain, or a verification mechanism would be used to provide access to the initial seed data and a hash would verify the contents. The user now has a running cache of data s/he can connect to with a private DNS-like verification system. User runs search.
* The parameters do not return the results s/he wanted from their private store. Similar to DNS, they move up food chain to the service provider (whoever creates this system, or one of the companies/orgs providing the service) to get more data. Here there is a centralized repo of information. This can be a market or platform. People can buy and sell data, filtering mechanisms and crawlers. Also people can include all of their searches, or some of their search results into the master crawl. This would be the "datacenter", but it can also be a platform that maps to many peoples individual caches:
> So if I indexed and codified everything about the Beatles I could sell this to the market by running my own server.
> I could sell a crawler that is really really good at finding all musicians and music to the market.
> I could sell a filtering/parsing engine plugin for music guys crawl results (or all results it is fed) that only delivers high quality FLAC audio files and converts high-enough quality MP3s to FLAC, all this but only for tracks with a Saxophone.
However, fucking music guys crawl stack doesn't have the shit I want in it.
I can buy (or write) a master crawler that goes out onto the internet and finds what I am looking for then delivers it to my private cache, and if I am generous, codifies it in a generally accepted meta language and inserts it into master.
Obviously there is much more here but what I am talking about is distributed and optimized search.
Notes: Google has a nearly impossible job:
* It does not allow a user to provide any filtering outside of some boolean operators and human language.
* Therefore it never knows exactly what the user wants.
* Provides a general service so to some extent it is one size fits all.
* Difficult to do machine learning because it can watch you make selections but may not ever be able to tell what the deliverable was or if you were successful.
* You can not backout or modify algorithim it uses to find results. Neccessarily, even when it knows what you want it is biased because it shows you the results and defines the algorithim. Also, in the fact I am baselessly making up right now, only 2.1% of users ever go to the 3rd page, which means that if google is wrong, it can't know and the problem compounds as users see the same bad pages and keep clicking them.
> I guess that's the classical problem of scaling a product to a large audience of mostly technically illiterate users.
yes. I am not saying google is doing a bad job. They have a nearly impossible task if they only use a searchbar with natural language and 0 filtering to deliver trillions of terabytes of data to millions of people. I am not sure how much easier it would be, but certainly n times easier, if filters worked.
Also, I think basically the idea of HTML is shit EXCEPT for the meta language. If not some simple JSON the actual results need fucking tags not the content, then we could filer down further and better.
Bitcoin payment for content/filtering/cooperation
Running arbitrary code in a sandboxed environment like docker, not a "DOM"
Also, the silo concept is like DNS if I didn't cexplain it super well. You have a cache on your computer, a cache in the cloud, access to a master cache of information (both receiveables and lookups of other silos) and an optimization market for searching through data, or finding more of it if neccessary.
100% obvious search will end up this way. Brave software seems to sort of get this. I am hoping they realize a browser can't be decoupled from search though because you can't just fork electron and put some plugins in it. They are super talented. I am hopeful. One of the systems similar to what I am suggesting is called memex-explorer. However, I have never used it as the build is currently failing. It was opriginally funded by DARPA and NASAJPL then one day all work stopped on it and I have emailed and tweeted some of the people and orgs with no response. So while doing research, the description seems somewhat inline with my thinking.
The large problem of scaling is handled by the market. Search is essentially an API to call APIs that call an RSS feed if you think about what your browser and google are actually doing. Knowing what those APIs do is pretty fucking important.
I try to share this info with people but they all think I am fucking insane. Does this sound that farfeched? Honest question.
So does it mean that Google will no longer index full WSJ articles or does it mean a change in the Google's policy?
Since this is billed as an "experiment" I'm guessing that WSJ is just testing the waters. If they roll it out to everyone, they will have to serve only snippets to Google or risk getting delisted.
From the ABA: "Exceeds authorized access is defined in the Computer Fraud and Abuse Act (CFAA) to mean "to access a computer with authorization and to use such access to obtain or alter information in the computer that the accesser is not entitled so to obtain or alter."
To prove you have committed this terrible felony, the FBI will now demand that Apple assist in disabling the secure enclave of your device in order to access your browser history. But remember, they only need to do this because they aren't allow to MITM all TLS and "acquire" -- not "collect" -- every HTTP request your machine ever makes. </s>
Which doesn't matter, because the WSJ is never going to sue you. But make sure you only consider your justification a personal one, not one that would provide any legal protection.
All I want to say is scrape cfaa now!
It is not specifically defined in the law, so it reverts to the traditional meaning: anything the owner of the system says you aren't authorized to access.
It's lunacy, I know. That's what HTTP headers and WAFs and such are for. But that's the stupid law, and it sent someone who used to be my friend to federal prison for changing a user agent and referrer and accessing unprotected data on the web.
Knowingly and with intent to defraud, accesses a
protected computer without authorization, or exceeds
authorized access, and by means of such conduct
furthers the intended fraud and obtains anything of value,
The mismatch between the world views of jurisprudence and engineers is a neverending source of joy. (If working in tech has made you cynical like me, that is.)
- to access a computer with authorization and to use such access
to obtain or alter information in the computer that the accesser
is not entitled so to obtain or alter.
Again, exceeding authorized access means using your authorized access to obtain information you were not "entitled" to. So the question is not 'were you authorized' but rather it is 'were you entitled' to that information? WTF 'entitled' means is another question entirely, but likely it is in the eye of the beholder. A jury decided Weev was not 'entitled' to the email addresses he downloaded from AT&T, and it's safe to assume we are not 'entitled' to free access to WSJ's content. So I would not rest your hopes on the "200 OK".
So, your example is...not an example?
>WTF 'entitled' means is another question entirely
No, in this case it is very clear: a request containing a particular user agent string is entitled. I have not tried this myself, but presumably you could verify that is the case by sending a request with the appropriate user agent.
Again I think you're confusing the fact someone could trick the server into delivering the content for free with WSJ intending to deliver their content to you for free. Since WSJ clearly intends their content to be delivered to only Googlebot for free and to users only if they pay, it is likely a jury would consider this a violation of CFAA.
A web server returning 200 OK is not ipso facto a guarantee the person making the request is not committing a crime. To give a more obvious example, if the request header contains a stolen authorization token. The law does not require the access control be non-trivial to defeat.
I don't like it, and I think the CFAA is seriously problematic, but it is the law and the Feds have been known to enforce it.
It is not at all clear that WSJ intends Googlebot to get their content for free while others must pay. Thus is actually against Google's policies, which would actually call into question whether WSJ's behavior is felonius. WSJ may not be entitled to be incorporated into Google's index, yet they are manipulating the Googlebot to the contrary.
At least in the US, the law doesn't work that way. Decisions will quite often cite some other similar case which reached the opposite conclusion, but under different circumstances, because that other case's decision says something like "X, if it weren't for Y" or "Fortunately for the defendant, they didn't Z, so not X", or something. That isn't binding precedent for the judge to apply X, but it's a very strong sign that X would be reasonable.
A court case that says "Yes, this violates CFAA but we have to throw out the case because A, B, and C" is very strong reason to believe that, if the next prosecutors avoid A, B, and C, the next judge will say "Yes, this still violates CFAA."
(IANAL but I read court cases because I find it useful to understand my jurisdiction's legal system.)
> a request containing a particular user agent string is entitled.
The phrasing of the law is very clear that the word "entitled" applies to a person, not to a request. Stealing someone's password and using their account is definitely a violation of CFAA (see e.g. http://www.wiggin.com/16332). In such a case, the account used to log in is quite plainly "entitled" / "authorized;" that's how you get the data. But the person logging in is not "entitled".
The other decision was vacated. The jury was not appropriate and their decision is irrelevant.
Unfortunately, when the prosecutors come, they will only care about the legal process.
The legal world cares about how the law applies to the facts of the case, not about how common sense applies.
Not saying I like it.
However, the set of people "authorized" is not, at least not from a legal perspective. This is what the case law says. The fact that the set of people who technically _can_ access the data is different from the set of people legally authorized to access the data.
That might not be what the engineers who designed the system, run it, and produce the content intended, but that is what the law says.
It's a bummer the two disagree. But only one of the two systems put you in jail if you cross them.
You and I may wish it were otherwise, but wishing isn't going to make it so.
> Likewise, implementations are encouraged not to use the product tokens of other implementations in order to declare compatibility with them, as this circumvents the purpose of the field. If a user agent masquerades as a different user agent, recipients can assume that the user intentionally desires to see responses tailored for that identified user agent, even if they might not work as well for the actual user agent being used.
That sure sounds like impersonating other user agents is allowed, but not encouraged. That is a clear distinction from being malformed.
Obtaining paywall-protected content by faking your user agent to purport yourself to be a Google Crawler is quite clearly fraudulent. This isn't a point for debate.
PS. To play along with the linguistic theme, can you provide a source for the definition of a malformed request? My original intent when using the word malformed was not to invoke it's technical definition but rather it's dictionary definition. But, having said that, I just had a 30 second Google hunt and couldn't find anything to corroborate your position.
This line of argumentation is beyond ridiculous.
If you meant to put forth that faking a user agent is a technique to exceed authorization, that's fine, and I'm glad to have helped you clarify it. Just be clear, it's not what you said with your detour into malformed requests.
That is demonstrably false. I, personally, can consider any request malformed unless it starts with the letter W. Regardless of what the standards say, I can think whatever I want to.
Similarly, the law can make up whatever rules IT wants to about the definition of "malformed". In that case, it pays some scant attention to things like standards, but mostly cares about "to the random guy-on-the-street (jury member or judge) did this seem like stealing". And there, I am afraid you lose.
I could say "the sky is blue" and you could say "demonstrably false; the sky is personally red because I'm wearing colour-altering sunglasses".
What he meant is obviously "a request can only be considered malformed (as defined by the commonly accepted definition of malformed) .."
Your argument is fallacious. The courts will use common definitions of terms and have regard for context.
However, what you're arguing about is even more pointless. The CFAA doesn't depend on the term "malformed" in any way, but on the term "authorized access".
Your first sentence is not similar at all to your second; they're demonstrating first a blatant disregard for common definitions of terms, followed by praise of such common definitions.
If you don't get it, court doesn't care about what's reasonableness, technically correctness, etc. Only if your lawyers can convince jury/judge. Twinky made me crazy, It the gloves don't fit... and so forth.
"Here is a transcript of electronic forum detailing how to circumvent access controls and defraud the victim using the exact methods defendant used to access victim's website. A forum the defendent heavily traffics. Often multiple times per day."
Ladies and gentlmen I ask you is it more likely that the defendent, a self professed developer, and expert in these circumvention methods, who reguralry participates on forums discussing hacking and defrauding companies such as the victim. I ask you is it reasonable to believe he "just forgot"?
Lawyers man, Lawyers! Can you not understand that rationality, technicallity don't matter. Lawyering is like statistics/graphs. You can get the data to say whatever you want.
general.useragent.override.netflix.com;Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:42.0) Gecko/20100101 Firefox/42.0
in about:config... what were they thinking?
Allow by IP range? You can probably find a somewhat accurate range for Google and whoever's crawlers.
Google was fortunately that no one sued them for these things before they got big enough to defend themselves. Many tech entrepreneurs haven't been so lucky.
You may ask why a big company like Google isn't doing more to change the CFAA or copyright law. The reason is now that they're big enough, legal grey areas like those in the CFAA (particularly, "what is unauthorized access?", because it's not defined by the statute) can be fully exploited, and Google can sit secure in the knowledge that they'll never be realistically challenged on it; meanwhile, they can then threaten potential competitors for doing the same thing, since a lawsuit against a public corporation takes 10 years and $5MM-$20MM. Anyone who could mount that kind of offense against Google won't, because they benefit from the grey area too; they'll just make some backroom deal with Google and not lose their lucrative, competition-destroying ability to do things that companies with sub-$100M revenues aren't able to do.
From the point of view of the law, that might not matter, but when there's a standardized way to make clear to bots that they're not welcome and you didn't bother to implement it, you'll look pretty silly if you complain.
If someone hasn't exploited a security bug (I mean a real security bug, like a buffer overflow, I don't consider behaviour such as serving up content to certain User-Agents only a genuine "security bug"), and they haven't bruteforced/cracked/acquired a password or private key, and they aren't sending unreasonable amounts of traffic ((D)DOS), it should not be a crime, and the law should be changed to reflect that principle. The law should reflect the common sense of the technically literate, but it doesn't, because it was written by the technically illiterate.
This is the spot we've reached through legislative meddling, and your best bet is either to be a good little consumer and don't make waves or do what you want and don't get caught. Neither of those seem to make a lot of sense in the long run.
So yes, congrats. It's jail for you -- but not for Google. Because they're Google, and you, well, you're not. You'd just better be happy we don't find out about you rooting your cellphone last year. Good grief.
ADD: I know you meant well, and I appreciated the </s> tag, but there was something I didn't like about your comment. Now I know what it is. By making this a big deal, you're increasing the likelihood that this poor schmuck becomes the next "example" some federal prosecutor decides to make. It's not your fault, but it still sucks. Let's hope that doesn't happen.
Edit: I was mis-remembering, the current law is against possession or manufacture of eavesdropping or wiretapping devices, not hacking tools. The EU has been playing with laws against hacking tools, but apparently nothing in the US yet against it.
The law makes it illegal to distribute devices (incl. software) that the design of such [software] renders it primarily useful for the purpose of the surreptitious interception of wire, oral, or electronic communications. Punishable by not more than 5 years and/or not more than $250,000. 18 U.S.C. 2512.
I don't think this blog post qualifies as an "interception" device,... however unauthorized retrieval and recording
of another's voice mail messages constitutes an "interception" so who the hell knows. I'm sure you could find a US DA who would argue the falsified User-Agent meant the software is designed to "intercept" communication meant only for Google.
While §202a and §202b punish unauthorized access to or interception of data, §202c extends this threat to obtaining passwords or creating tools in preparation of such an act.
Note that this section has only existed for a few years. I am not aware if white-hat hackers have actually been prosecuted for creating hacker tools.
And by "may", I do mean "may". I don't know. But it's at least possible.
No person shall manufacture, import, offer to the public, provide, or otherwise traffic
in any technology, product, service, device, component, or part thereof, that—
(A) is primarily designed or produced for the purpose of circumventing a technological
measure that effectively controls access to a work protected under this title;
(B) has only limited commercially significant purpose or use other than to circumvent a
technological measure that effectively controls access to a work protected under
this title; or
(C) is marketed by that person or another acting in concert with that person with that
person’s knowledge for use in circumventing a technological measure that effectively
controls access to a work protected under this title.
The wording reminds me of a similar section in the German Copyright Law which outlaws the circumvention of "effective" copy-protection schemes.
How an access control scheme can be effective and circumventable at the same time is completely beyond me. :)
Here's how HTTP(S) works: I issue a REQUEST to the web server; the web RESPONDS with it, or denies it. It is up to the web server to respond or deny or do whatever it wants. If the web server is badly implemented or doesn't know what it's doing, it is the webserver's fault.
Remember: it's just a request. I can request 100 dollars from you; the fact that you give them to me does not make me a mugger.
So you go up to a server and lie to it, and it gives you something; is that not acquiring things through deception?
The structure of your argument suggests that e.g. breaking into an ssh server by issuing a login request with a known password which it responds to, isn't illegal. And further, that if data is acquired from the server, there is still no crime - the ssh protocol too is just requests to the server, it's all bits down the line. It's clearly nonsense.
Whether the lock was implemented poorly or you just didn't lock it — doesn't matter.
"If a user agent masquerades as a different user agent, recipients can assume that the user intentionally desires to see responses tailored for that identified user agent, even if they might not work as well for the actual user agent being used"
It's a param exactly equivalent to asking "who would you like to be treated as?" If I say "the pope!" and you respond by kissing my ring, I have not defrauded you.
You're talking about a sign with practically no legal meaning.
This discussion is about a law which does have legal meaning.
To fix your analogy, it would be "If there's a law that says only explicitly authorized cars are allowed in a car wash; else 5 years in prison, and I have a car wash and say 'only red cars' ...". Of course, that still doesn't properly capture this since the intent of the law also matters and that law would be senseless, and so interpretation would be less obvious.
The better analogy is the one below about breaking into homes being illegal, but what if you happen to have a key that fits the lock? (though that also is a bad analogy in its own way).
Basically, making physical analogies for technical matters is rarely correct. It's often the best way to convince non-technical people of a matter without them needing to actually understand it.
If there's anything the tech sector should be able to come to agreement on, it's that lock metaphors in a situation with absolutely no hidden knowledge or private tokens make us all dumber.
The door is locked, the key is right there and all you have to do is pick it up and you can gain access, but that doesn't make you authorised to gain access.
The CFAA and UK CMA may well be overzealous in restricting what should (to technologists) be allowed, but it's what we have as statute in the respective jurisdictions.
The law says "authorized access" and the WSJ authorized Google to access their content in order to index it. The WSJ did not authorize that content to be presented, for free, to you, an end user, necessarily. I don't know which way a court would rule on it, but it's definitely not black-and-white.
Sure, it's technically similar, but the court doesn't care. The court doesn't care if the law makes no technical sense, because it's a law, not a program.
In other words, yes, the WSJ does intend to only give Google access to their content, and not the general public.
But no, the WSJ has not "authorized" Google by anything more official than a bank telling their security guards to let anyone into the vault who is wearing a blue t-shirt.
So yeah, I agree with you, there is a lot of conflation of technical means and the law, but we also shouldn't be granting to the WSJ that they are doing any real "authorizing" here, beyond wishing it and hoping it stays true.
(See the second precept here: https://en.wikipedia.org/wiki/Five_Precepts)
Morally, I'm not sure.
1. If they allowed all bots but disallowed all non-bots, that would raise questions of what defines a "bot". Couldn't I write a personal bot that fetches the story for me? As a browser addon, even?
2. It's even more complex since allowing bots means they allow tools that provide the information to third parties, as the bots are not intended for private use by the bot maker. So the door is already open.
3. But in practice, it seems they favor certain bots. Is it ok that the WSJ lets Google do things Google's competitors cannot? Try to use the "web link" trick from HN on any other search engine, and it doesn't work in my experience. That seems anti-competitive and discriminatory in favor of the existing dominant entity in this space, Google.
Maybe, but I think its a pretty easy distinction. They aren't even allowing all bots - they're allowing a white list of them. You're not just writing your own bot to get around it, you're pretending to be someone else's bot.
> But in practice, it seems they favor certain bots. Is it ok that the WSJ lets Google do things Google's competitors cannot?
That's the really important question. I personally have no context for answering except to say that I can see both sides argued. If you view their website as a physical store / private establishment, then I assume that they have every right to establish who has access to what and under what conditions.
Of course, that hampers a lot of legitimate use cases along the way.
Not true. There are very specific laws about not being able to discriminate against protected classes of people.
Bars have to serve minorities, bakeries have to cater to same sex marriages, etc.
Where you draw the line of legislated equality within private property rights is pretty intriguing. I have don't have any answers, but lean heavily towards the libertarian bent.
The only loss is the energy/bandwidth/cycles WSJ servers spent answering your request. Which, I believe, has been basis of computer "fraud" cases.
This can't be true. Surely the argument for why, say, a WSJ-paywall-bypassing-tool causes damage (in the legal sense) to WSJ is that it allows people who would otherwise pay for content to get it for free, thus depriving WSJ of income.
Moreover, I don't think prosecutors need to prove that you caused harm in order to charge you with computer fraud, since, for example, CFAA falls under criminal law.
But the sole existence of this trick and the person that would use it is exactly someone who would not pay, therefore your argument does not hold. And stealing it is not, it is more like listening to the outdoor rock concert beside the fence because you don't want to pay, inconvenient- sure, so plenty of people would still pay.
Cloaking is against Googles rules so it is WSJ that's dishonest.
Prosecutors never (want to) just charge one thing. They want a laundry list of dozen or more crimes so they can coerce suspect into pleading guilty. The, "theft of resources" would just add to the pile.
More importantly, is violating "Google's rules" suddenly a violation of law?
IANAL and this is most likely wrong, but kind of plausible to my NAL mind.
If they are advertising incorrectly, they should fix that.
I'm not taking, I am simply absorbing information. That information will still be there when I am done reading it. Have I really stolen, or did I just refuse to give someone money on demand?
What people choose to do with the information is another story...
I think the relevant point, underscored by the author's last sentence, is it doesn't matter who you open a back door for - it opens the possibility for anyone to barge through.
Take a look, for instance, at the WSJ.com home page with an ad blocker turned on (note all the missing letters and scrambled up titles). They want me to pay, and they want me to see ads, and they want to track my behavior? Should I send them my DNA also?
Organizations like WSJ are exactly the disease that causes ad blockers to proliferate and ruin the web for all the decent publishers. They're at war with my privacy (by breaking their site intentionally when I visit with a blocker on). They want it all, ads, tracking, your private data, and subscription revenue, not to mention...
# Agenda-Driven Content
I mean, we're basically talking about NBC or Fox here, just on the web. Imagine every morning when you woke up you turned on the television and tune to some "news" show. After talking about the weather, they start talking about a lost pickle that is thought to be potentially alive and moving about with free will. Over the next two years, talk about the same pickle extends to every other TV show. Before you know it, everybody in the nation is talking about the same pickle. Years go by, and that pickle has become a part of our society, and that's not because people are born with an innate care the well-being of pickles, but because "news" shows taught them to be.
That's not a good position to be in. I have to believe I'm not the only one in here that doesn't watch any TV. So, why do we all treat the same media giants differently on the web? We crave their content so much that we build browser add-ons to get to their content, etc.
You aren't entitled to WSJ.com, NBC, or Fox.
I don't actually see what you're referring to; maybe its because I get redirected to http://www.wsj.com/europe. Maybe I have a different ad-blocker. Either way, it reminds me somewhat of NME's  homepage (New Musical Express, a popular music publication; not sure if it's really known outside of the UK). They deliver their images in such a way that they fall foul of my ad-blocker, although I haven't looked in enough depth to be certain whether this is a way of preventing ad-blockers, or purely unintentional.
Uh, what? Using uBlock origin, when I visit wsj.com I get what looks like a perfectly normal page. Nothing is scrambled at all.
EDIT: The "paste a headline into Google" trick still works for me, though. If this continues to be the case, they will keep indexing, of course.
So people can find it? I'd be pissed if Google de-indexed something like IEEE because it has a paywall.
Assuming the internet has to be freely available is a mistake. Especially with the continued growth of an adblocked internet. We could be facing an internet with significant paywalls in the future.
I'd support a "free" search term to weed out paywalled results.
Furthermore, Google shouldn't be making normative judgements about what people should see. It's an abuse of their monopoly.
WSJ is free to institute a full paywall and only serve snippets to Google. They might now like what it does to their rankings though.
What they cannot do is continue to sniff the UA before deciding to put up the paywall. (Though I'm still able to use the Google trick, so it seems the experiment might have ended.)
And yes, this violates Google's policies laid out explicitly at https://support.google.com/news/publisher/answer/40543?hl=en
Additionally, I would like to point out that I wrote a Varnish extension for the express purpose of validating User-Agent strings through DNS lookups, and is available here: https://github.com/knq/libvmod-dns
It was built because we had specifically a problem with bad bots crawling a large site (multiply.com) and this was one of the easiest ways to filter out the bad bots from the good, and to enforce robots.txt policies on a per bot basis. It works very well, as you can do any kind of DNS caching internally and prevent this kind of behavior, if that's your goal.
That being said I do enjoy their content, save for maybe the op-eds.
They all seem to want to sell subscriptions, which are perpetual and probably difficult to cancel..
The pricing here is much too aggressive
Playing devil's advocate here. Pricing for many online goods is almost completely arbitrary and varies with little accord to service/product quality or even what that service provides.
Another related example of arbitrary pricing: people will pay $2 for a soda from a vending machine but won't pay $1 for a useful app on their phone.
There's something going on there... The sooner the $1 app's figure out what makes people buy $2 sodas is the day they become rich. And the sooner content providers figure out why people will pay $20/month to stream media (Let's say $10 to Netflix, $10 to Spotify or something) and charge people $20/mo for their articles... things will turn around for them.
Because news operate at a different scale than movies. People are still reading 40-year-old books.
A 40-year-old news article is not relevant anymore because it only fits within the momentary context in which it was created, whereas a non-fiction book or even an essay can span a broader context and thus stay as informative for future readers.
I don't know if it's available in the US yet but they are at least planning to launch in the near future.
Are they running afoul of Google policies and going to get pinged by Google?
I can't find the text from Google now (when can you ever find any docs at google?), but I am very certain I remember reading from them that you may not return different content to GoogleBot based on User-Agent.
Otherwise, why would expertsexchange be obligated to provide the answers at the very bottom? Did something change?
Those signatures could obviously leak, but on a per-domain basis. Perhaps the domains could have a secure way of bumping the valid key generation if they had a leak.
First, they don't want to. In fact, if a search engine can figure out that a link is going to lead to a paywall, they'll probably want to reduce the ranking of the result, because the user is not going to want results they can't actually look at.
Second, it would be a massive antitrust violation because it would prevent access by competing crawlers. The only way around that is to allow access to anyone who claims they're a crawler, which was the original problem.
AFAIK no content provider actually does this check though.
Also, isn't it illegal to bypass computer security?
Their server can choose to do what it wants with your request and you can choose what to do with the response it sends.
Are User-Agent headers legally protected identities?
Then again, lots of sci-fi dystopias are dreams of an automated law that somehow destroys the fabric of society, so...
In a related point, some news sites load a modal and prevent scrolling over an article asking you to sign up. However if the full article is included in the response and I read it by simply viewing the response body (HTML) is that circumventing security? Actual example. In this case modifying how the response is rendered in my browser, I can bypass their intentions.
Obviously the better way of doing this would be to not send the entire article content until they've determined I should be able to view it.
I think so. Just the act of changing User-Agent alone does not mean a fraud is happening. User-Agents get changed often for valid reasons - research, detection of cloaking, testing etc.
> However if the full article is included in the response and I read it by simply viewing the response body (HTML) is that circumventing security?
(\) Unfortunately, it seems legal systems allow companies to restrict what you do with things they produce, even if you do it on your computer in private, so legally, this may not be successful in court.
That is completely idiotic if there is a string you can put in a Mozilla browser config that is literally illegal to browse the web with.
I actually made a bookmarklet with the following pasted into the URL, so you can do it in a single click:
I've used them to save Facebook posts before, and the pages were logged in to some "Nathan" IIRC. They probably have a bunch of hacks for specific sites that needed fixing.
I then pasted the headline into google and clicked on it from Google results and did not get hit by the paywall.
I thought that google deemed providing search results which were behind paywalls as a "bad experience" for their search users, and would penalize websites for doing so.
Is this no longer the case?
For the second point, Google does require that publishers specify "registration required" in their sitemap.
Not an SEO Expert here, but wonder how and whether Google will end up handling that. I mean making an exception could also be considered abuse of power in some countries of the world. Don't have any strong opinion yet on that, just saying that because of how the EU exercised certain laws in recent years.
User-agents are notoriously unreliable.
Content providers register a (yet-to-be-written) Google News API account, get an API key, with which Google indexes the site and the site recognizes as legit.
Great idea here guys
Also, Google has published IP addresses it uses, so this extension might not last long...
They do not , but you can find out by doing a reverse DNS query.
> "Google doesn't post a public list of IP addresses for webmasters to whitelist"
2. The idea that this is somehow new is wrong. The way for a server to identify crawlers have "always" been to look at the user-agent, and, when done right, IP, verified either by net block owner or by doing PTR lookup and then checking that the A or AAA record for the claimed host points back at the same IPv4 or IPv6 address. Meanwhile, I do agree that paywalling is a more recent phenomenon, at least with regards to the extend it is popular among sites today, but the concept of presenting different data to crawlers and visitors arose much earlier and is something Google have been aware of and has made sure to delist such sites when found, whereas in fact Google has since then moved abit in the direction of allowing it in that they do so for Google News if declared as explained by others ITT.
So in my view, it seems that the author is jumping to incorrect conclusions based on an incomplete understanding of what's actually going on here. What then about the HN readership, how come this article became so highly voted and I don't see these issues raised by anyone else? Or maybe I'm just crazy?
Don't nitpick. It's just a shortened version of How To "Be" a Google’s Web Crawler to Bypass Paywalls. You get it. I get it. Everyone gets it.