Hacker News new | past | comments | ask | show | jobs | submit login
Google PDF Search: “not for public release” (google.com)
268 points by webmonkeyuk on Sept 1, 2015 | hide | past | favorite | 56 comments

This was one of the most ironic things I found: https://s3.amazonaws.com/reviz-tutorials/The_Pirates_Code.pd...

Trying "Index of /backup" gave me a heart attack.

Holy shit... you weren't kidding. The first result is a BANK for Christ's sake... man.

Looks like some MS Access databases, and wow (for the bank)

edit: clarification

Today I learned that the Windows version of Perl is somehow associated with criminal behavior on the high seas.

There's a whole collection of this kind of search engine query at Hackers for Charity:


"for official use only" or "U//FOUO" brings up interesting results, the pdf "U//FOUO Sovereign citizens extremist ideology" by the FBI was a good read so were all the Interpol recent internal reports about all their weapons that have been "misplaced" or stolen.

Some times people are ignorant but sometimes they are clever.

There are even more "Top Secret" documents.


(The above is sarcasm).

One of them was. It was just from 1960 and probably declassified.

Offtopic but am I the only one seeing this? http://i.imgur.com/3pnOXot.png I thought google moved away from that black bar.

I'm seeing it occasionally.

I seem to remember that they were using it to punish out-of-date browsers. But since I'm getting it with up-to-date Chrome, it doesn't seem to be very well targeted.

A/B testing.

Much more interesting if you limit the search by time.

Less than a month old? A single screenful, mostly Australian.

A lot of these are redacted and appear to be FOIA or similar requests that have been fulfilled.

And then again, after looking, some are clearly not.

This one was quite sad. The suicide of an inmate: http://www.drc.ohio.gov/public/after_action_castroA643371.pd...

Uh. Unless I'm mistaken, that particular inmate pleaded guilty and was convicted of 937 various counts raised against him, including murder and rape. He kidnapped three women (in 2002, 2003 and 2004) and kept them imprisoned in his basement for nearly 11 years during which time he did horrible, unspeakable things to them.

morality is subjective

Alright, I'll bite. People like this guy, serial killers, malignant sociopaths operate outside of society's morality borders that you're talking about. How can we possibly evaluate them within them?

I hope the cops tell you that next time you call them.

Isn't this one supposed to be public information anyways?: http://www.oema.us/files/FBI-OFFICES.pdf

Hmm, it doesn't appear so. The header says, "NOT FOR PUBLIC RELEASE - PUBLIC SAFETY AGENCY USE ONLY"

I wonder if those are the sort of links that can leave you on the wrong side of the Computer Fraud and Abuse act...

IANAL, but typically CFAA violations revolve around crafting special URLs, as in a forced browsing attack. Simply following a URL, is AFAIK not (yet) a crime.

Intent is key, not the technical approach. If you're intentionally trying to access files you clearly aren't intended to be accessing, you're probably guilty of unauthorized access.

Now subtract -"not for public release until"

Not quite the same, but the results for https://www.google.com/search?q=Hyperlinking+to+the+Site+fro... are interesting.

Priceless: https://www.amherst.edu/system/files/media/1349/Finalproj_s1...

Kind of makes me want to take Geology there, sounds like a fun place.

Assuming filetype:docx is even worse?

The "not for public release" portion of that document (pp. 72-75) is not included in the PDF.

Tennessee execution procedures? lovely

This is pretty interesting. One did say "Not for Public Release UNTIL", so could presumably be intended, but in a lot of cases webmasters probably didn't think something would be found and indexed by Google wherever they put it. And were wrong.

This is a great example of the house of cards all our network systems are built on top of.

Imagine this scenario: you maintain a network of web servers, database servers, file servers, etc. They all combine to generate a large website used by tens of millions of users every month. One day you are just doing a cursory look over a certain server, but you see something strange. Someone is logged in to your server. And they have a Russian IP address.

What do you do? Obviously, the first step is you login to your edge routers and null route all of Russia. GFTO. Next, you've got an idle session on one server. What were they doing?

How can you reconstruct what they were doing? bash history? maybe. Network forensics? Your network probably isn't recording every historical connection between servers—99.9999% of the time useless—but critical in this case. File system access? Your file system probably isn't logging every historical access—useless 99.99999% of the time—but would be really freaking useful in this case.

So, you investigate their history, double check some database logs, check netstat, check lsof, and in the end, you really have no idea what they were doing at all. Our systems don't leave enough bread crums around to reconstruct even interior hostile activities, much less semi-intelligently disallowing Google to not index confidential information when accidentally left exposed.

WRT detecting Google doing indexing, it's actually trivial. Web server logs will clearly show Google's web spider(s), and if you want you can set some monitoring (lots of methods here, all the way up from a cron job running a grep).

I can't remember the quote exactly, but if you're reacting to a breach it's too late.

Obviously this case is detectable, but it's detectable after it happens since permissions weren't correct in the first place.

Who keeps web logs these days? It's all spyware javascript tracking for pretty graph printing.

Plus, any notifications depend on actually instrumenting any monitoring or triggers or processing to even notice your "sensitive" content has been accessed out of context.

(and this is just web stuff. imagine how impossible it is to track who forwards your confidential emails or other internal documents around without your permission.)

> Who keeps web logs these days? It's all spyware javascript tracking for pretty graph printing.

Anyone who needs records of what has been accessed, so larger companies and organisations.

> Plus, any notifications depend on actually instrumenting any monitoring or triggers or processing to even notice your "sensitive" content has been accessed out of context.

Yup. Hence a cron job automatically emailing its result (crude (or simple?) but it would work).

> (and this is just web stuff. imagine how impossible it is to track who forwards your confidential emails or other internal documents around without your permission.)

I don't have to imagine that. This is why DRM exists; document/knowledge management systems should have the ability to allow access to information but not further dissemination. There's still the user education aspect though (and users don't like change...).

Oh, and the insistence of wanting to using external services like Dropbox... gah. "But, but, everyone else uses it!"

You are technically right on all counts.

But we live in a new world. A world of BYOD and now, in 2015, Bring-Your-Own-SaaS. Employees put content up on company platforms, on third party platforms, on high heel platforms.

The problem of solving data privacy at a _competent_ level across every organization is intractable with so many "just do whatever you want" vibes in the air.

Now, that obviously doesn't happen everywhere, but it happens everywhere until it doesn't. Biggest offenders are usually non-technical offices: sales using 8 hosted platforms for metrics, email, surveys, project management, job hiring, etc. All impossible to actually control at any sane level outside of 340 UI clicks of the mouse across webby webby land.

tl;dr give up and go live in a cave for the next 30 years until all this gets sorted

the magic of "turn-key solutions"

when you decide to buy something for $x instead of paying someone who knows what they are doing to implement something with proper standards for $5x show on things like this

They should have at least have set an owner password for these documents. (In practice, they are not effective preventing people to disregard limitation that you set on the document, but at least it'll exclude documents for indexing at least by Google.)

I think the bare minimum would be to put them all in one directory and use robots.txt to hide them from Google.

Sure, it's weak, but at least it won't be accessible through Google.

What's especially crazy about these are that so many have been cached by Google. Anyone can read these docs and only Google would ever have a record.

Ironically, a lot of the top results now are about this phenomenon. Reminds me of that page that deleted it self when indexed

Combine with site:[url] for smaller scope. example: "not for public release" filetype:pdf site:house.gov

Why use "filetype" and not "ext"? The results are identical.

If the results are identical, then who cares? If it's just a matter of saving five keystrokes, I wonder if your 60-keystroke comment was a good use of time...

it may have more results with the filetype, some server generated PDFs doesn't always have the extension.

Isn't ext just an alias for filetype? http://www.googleguide.com/advanced_operators_reference.html

Then what in the world are you arguing?

"This is an undocumented alias for filetype:" (emphasis added)

Seems like reason enough to avoid ext: and/or be ignorant of it.

Then why use "ext" and not "filetype"?

Because the submitter may not have known about "ext"?

(Thanks, now I do.)

probably because ext is easier to use and Google will therefore probably discontinue it shortly...

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact