
Using wget and grep to explore inconveniently organized federal data - danso
https://gist.github.com/dannguyen/26e5922614dc22053745
======
baldfat
[http://openrefine.org](http://openrefine.org)

If you are not going to learn AWK, R, Python or Julia the next best thing is
Open Refine.

> OpenRefine (formerly Google Refine) is a powerful tool for working with
> messy data: cleaning it; transforming it from one format into another;
> extending it with web services; and linking it to databases like Freebase.
> Please note that since October 2nd, 2012, Google is not actively supporting
> this project, which has now been rebranded to OpenRefine. Project
> development, documentation and promotion is now fully supported by
> volunteers. Find out more about the history of OpenRefine and how you can
> help the community.

~~~
kazinator
TXR: [http://nongnu.org/txr](http://nongnu.org/txr)

------
conductor
This reminds me of Manning's case.

 _" That Manning was convicted of computer fraud seems to suggest that using
wget on a U.S. government computer to download large numbers of files can be
considered the digital equivalent of trespassing – even if it's on turf you're
otherwise allowed to access."_

[http://www.washingtonpost.com/news/worldviews/wp/2013/07/30/...](http://www.washingtonpost.com/news/worldviews/wp/2013/07/30/the-
free-web-program-that-got-bradley-manning-convicted-of-computer-fraud/)

~~~
morninj
Well, sort of. In the Manning case, the government argued that wget was
unauthorized on government-owned client machines. Their argument in that case
wouldn't apply to the public's use of wget to send requests to government-
owned servers.

~~~
samstave
Then we need a scan of all government machines to see if there was a precedent
of it being installed across many thousands of machines to suggest that this
was not an actual individual breach of policy. and show that the government
was remiss in its ability to enforce the policy at all.

------
adricnet
ProPublica has some great material on techniques for data journalism,
including scraping and transforming unfriendly data formatting:

[https://www.propublica.org/nerds](https://www.propublica.org/nerds)

~~~
baldfat
The URL is nerds??? I look at the page and there was nothing on there that
seems to show it as a acronym either. Just the title of the blog. Not sure I
like that.

~~~
minikites
I'm skeptical of a complaint like this from someone who voluntarily chose the
handle "baldfat".

~~~
baldfat
Bald and Fat is what I am. 6" 1" 230 pounds and I am bald like Cpt Picard
since I was 25.

Nerd has a negative connotation as opposed to Geek. I know seems weird but
getting called nerd was very negative before Geek/Nerd was cool.

Definition: 1\. a stupid, irritating, ineffectual, or unattractive person. 2\.
an intelligent but single-minded person obsessed with a nonsocial hobby or
pursuit: a computer nerd. [1]
[http://dictionary.reference.com/browse/nerd?r=75&src=ref&ch=...](http://dictionary.reference.com/browse/nerd?r=75&src=ref&ch=dic)

------
csense
One interesting thing the author mentions -- interesting despite being
somewhat irrelevant to the topic -- is the astonishing fact that the vast
majority of people don't know how to use Ctrl+F.

I submitted it here:
[https://news.ycombinator.com/item?id=10321439](https://news.ycombinator.com/item?id=10321439)

------
ape4
Its so sad there isn't a site with everything in plain sensible formats.
data.gov seems like a good start. But a plain text UI (like gopher or ftp)
seems better. A subfolder for each state. And a subfolder for each department.
Maybe something like Linux's /proc for laws and federal data.

------
spectre256
This feels like a data set that might already be on the amazing
[https://enigma.io](https://enigma.io)

~~~
otoburb
Enigma's secure URL isn't working. The working address is
[http://enigma.io](http://enigma.io)

~~~
spectre256
That's disappointing. I actually didn't even check that they used HTTPS, I
figured it was safe to assume they do. Guess not.

------
melling
Anyone know of a quick trick to extract the English text from html files?

~~~
dmckeon
Try:

    
    
      lynx -nolist -dump  
    

-dump dumps the formatted output of the default document or those specified on the command line to standard output.

-nolist disable the link list feature in dumps.

See also -crawl and -traversal

