Hacker News new | comments | ask | show | jobs | submit login

I published a tiny script that makes mass grabbing of files from Github easy (https://github.com/cortesi/ghrabber), and wrote about some of the interesting things one could find with it. For example, there are hundreds of complete browser profiles on Github, including cookies, browsing history, etc:


I've also written about some less security critical things, like shell history (http://corte.si/posts/hacks/github-shhistory) custom aspell dictionaries (http://corte.si/posts/hacks/github-spellingdicts), and seeing if one could come up with ideas for command-line tools by looking at common pipe chains from shell histories (http://corte.si/posts/hacks/github-pipechains).

I've held back on some of the more damaging leaks that are easy to exploit en-masse with a tool like this (some are discussed in the linked post, but there are many more), because there's just no way to counteract this effectively without co-operation from Github. I've reported this to Github with concrete suggestions for improving things, but have never received a response.


This works pretty good too, does not suffer from the github blocking of your script and is probably even easier.

Github might include something like a warning on your repo that it includes possible data that you might not want out there.

Github search does many, many things you can't trivially recreate through a Google search:


You can access all of this functionality with ghrabber.

One of my suggestions to Github is that they disable indexing of dotfiles of all persuasions (including contents of dot-directories), unless the repo owner explicitly opts in. That would make it much harder to find a very large fraction of the more obvious leaks.

Which of those functions do you feel would allow you to find vulnerabilities quicker?

Nearly all of them, depending on what exactly you're looking for. Simple things like being able to exclude results from forked repos can save a huge amount of time, and being able to limit results by language, creation date and even number of stars (to find personal repos) has come in useful.

I've heard tell that there are people out there who make use of GH's activity feeds to scrape just about every action that's taken on all GH repos.

If true, doesn't this make crippling the usefulness of GH's search really superfluous?

Full disclosure: I'm never a fan of crippling search to cover the ass of someone who has pushed sensitive information to a publicly accessible location. I'm still sore about Google's decision to do things like prevent one from searching for -for instance- credit card numbers. :(

If looking at the common pipe chains from shell histories tells me anything, it is that people are not very familiar with the tools at their disposal.

Just look at some of these chains:

    ps | grep
    cat | grep
    find | grep
    find | xargs
    grep | wc
    ls | grep
    echo | grep
    grep | grep
The legitimate uses for those pipe chains (while they do exist) are few and far between...

A particularly odd one on the list was `type | head`. Does anyone know the purpose of this?

The legitimate use comes from real world situation. Quite often I first cat a file only to find out it is too large to look through by hand and only then grep it. As the last command was `cat /some/long/path`, I go up to the last command and just add the grep to the end (and thus end up with cat | grep). Likewise, I vaguely remember that grep has a count switch, but looking it up in the manpage is more work than using wc (-> grep | wc). And likewise, before I would use find's -exec command I would need to look up the precise syntax again, because there where some details regarding character escapes, IIRC.

Remember, we are always intermediates at most things[1].

[1] https://blog.codinghorror.com/defending-perpetual-intermedia...

This isn't "shell golf". This idea that we shouldn't use small focused tools in a chain, but rather we should find as many arcane arguments and switches as we can to shorten the chain, is contraindicated by the Unix philosophy. I don't know why this nitpick comes up so often. There are many "human factor" reasons why a longer chain with simpler commands is desirable.

Is it really so arcane to `grep PATTERN FILE` or `grep PATTERN` <kbd>Alt .</kbd> (if the previous command was `cat FILE`)? Is it also arcane to `pgrep PATTERN` instead of `ps aux | grep PATTERN`? Is it also arcane to `egrep 'PATTERN|PATTERN'` instead of `grep PATTERN | grep PATTERN`? Personally I prefer the "correctness" of this sort of approach, but the tools are just a means to an end and understandably people have varying preferences. Ironically "legitimate" was probably not an accurate choice of words.

> `egrep 'PATTERN|PATTERN'` instead of `grep PATTERN | grep PATTERN`

Oops? Ironically (assuming two distinct values of PATTERN) I think you just answered your own question. (They are different: first is disjunction of patterns, second is conjunction).

Your point has merit for scripts (performance) but for data exploration at the prompt it's almost always irrelevant: the simplicity of pipe composition outweighs anything else.

Whoops you've got me there. Yes for that example, the grep alternative is not very elegant. Anyways I wasn't making an argument against composition, just particular types of composition (such as useless use of cat, parsing ls, grepping ps) for which there are side-effects or there is a simpler or more appropriate alternative.

I'm familiar with these tools, but still `| grep`. Once you've learned the "compose little tools" philosophy, it's hard to get it out of your head.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact