

Search-Script-Scrape: Web scraping exercises in Python 3 for data journalists - danso
https://github.com/compjour/search-script-scrape/

======
Sven7
If you are a young journalist being told "data journalism" is the must have
resume bullet point for the future, here's some unsolicited advice from
someone in techland who has worked with journalists.

1\. Don't waste your time on this stuff if you have no interest/aptitude for
it. I see people being pressured into it when it's not the right fit. The kind
of people who will have success with this, are the Nate Silver's of the world
who are really domain experts dabbling in journalism.

2\. Being a journalist gives you access to data and access to experts. Bring
the two together whenever you can. It takes time and skill to develop that
access. And in most cases, it's time better spent than learning python. Matt
Taibbi is a good example of this. He was able to make sense of something
complex (2008 meltdown) by bring the data and the experts together. No Python
necessary.

~~~
danso
OP here: I don't necessarily disagree with what you've said here. The
"Computational Journalism" class is an elective at Stanford, and while some of
the students are from the journalism program, others come from more technical
fields such as CS or MSE. The programming part for them is not a huge
challenge...but besides the exposure to civic issues and data policy, for some
of them, this is the first time they've worked with things like webscraping
and public-facing APIs (as was the case for me in my computer engineering
degree program, though that was years ago).

So there's a decent sized group of technically-apt students at Stanford who
are interested in journalism. And my advice to them would be to at least
intern as traditional reporters, as there's no better way to learn the work of
developing access and sources (as well as interviewing and writing on
deadline!).

That said, there are opportunities to quickly explore a domain if you're
skilled at data collection and analysis. One of the best examples I can think
of is this writeup by a couple of data reporters about their investigation
into Florida cops:

[http://ire.org/blog/on-the-road/2011/12/20/behind-story-
trac...](http://ire.org/blog/on-the-road/2011/12/20/behind-story-tracking-
police/)

> _This was a case where the government had this wonderful, informative
> dataset and they weren’t using it at all except to compile the information.
> I remember talking to one person at an office and saying: “How could you
> guys not know some of this? In five minutes of (SQL) queries you know
> everything about these officers?” They basically said it wasn 't their job.
> That left a huge opportunity for us._

This scenario -- in which the data is freely available but no one thinks to
simply collect it into a spreadsheet -- is just the tip of the iceberg of data
work that needs to be done...but I'd be lying if I said that this kind of low-
hanging fruit was rare...There's plenty of information out there that's just
begging for efficient examination...to paraphrase a classic adage, the problem
today is not that we lack information, but we lack ways of filtering and
understanding it.

I'll leave aside the debate of how worthwhile it is to try to teach
programming to traditional journalists -- it's definitely not easy work...but
there's a great deal of potential in teaching comsci students about civic and
journalistic issues and how specifically to apply their skills. I turned out
OK after first spending a few years as a newspaper reporter, but I think I
missed some opportunities to hit bigger...but back then, I had no concept of
mixing my programming background with my journalism.

~~~
Sven7
Appreciate your answer and what the course is trying to do.

All I'll add is, it's good to be aware of the contradiction all that data
presents. The contradiction shows up in your post and I have a feeling you are
aware of it.

"Quickly exploring a domain" and "efficiently examining the data" are
inherently contradictory. To resolve that contradiction (going back to my
previous post) is to (a)get an expert involved as quickly as possible or
(b)become the expert.

And its healthy for someone starting out (be it a journo dabbling in compsci
or a programmer dabbling in journalism) to keep asking themselves (based on
their aptitude\motivational levels) which road they are taking.

------
gtrubetskoy
Unless I'm missing something, the README doesn't mention that all the examples
rely on "requests" (which is not in the standard lib or Python 3 specific,
thus title is a tad misleading):
[https://pypi.python.org/pypi/requests](https://pypi.python.org/pypi/requests)

~~~
rspeer
I don't get it. What part of the title would imply that it has no dependencies
outside of the standard library?

------
j4kp07
Maybe I am being picky, but is traversing JSON files truely "web scraping"?

~~~
danso
No, you're correct...these exercises were deliberately kept programmatically
simple -- e.g. single loops and conditional statements -- ...not everyone
student had much CS experience, nevermind web scraping. In cases where JSON is
being parsed, it's usually because that's the easiest way to access the
data...but the "skill" in the exercise is recognizing when a website feeds
from such an API...and then go direct to that source.

For example, usajobs.gov is a consumer-friendly jobs search site. You _could_
find the number of librarian jobs by manipulating the web form...or you could
do a little looking around and see that there's an API:

[https://data.usajobs.gov/Rest](https://data.usajobs.gov/Rest)

And just as importantly, there's an official taxonomy for federal jobs:
[https://www.opm.gov/policy-data-oversight/classification-
qua...](https://www.opm.gov/policy-data-oversight/classification-
qualifications/general-schedule-qualification-standards/#url=List-by-
Occupational-Series)

So being able to look at a website and deduce what might be behind it is good
enough...and is actually what I would do in a real-world situation rather than
just trying to reverse engineer a site.

And there's the increasingly common situation in which the website loads data
client-side, such as analytics.usa.gov...and so inspecting the network traffic
and working with the JSON files is the only way to collect the data displayed
on the website.

~~~
erroneousfunk
One of the most important skills a web scraper can have is being able to take
the easier path and use an API where it's available. APIs, after all, are just
really really nicely-formatted webpages that follow an additional set of
generally agreed on standards. If you were interviewing a web scraper for your
company, and they didn't know how to parse JSON, you'd probably think twice
about giving them an offer. I think it's entirely appropriate -- and necessary
-- to include JSON parsing in the list of exercises. (Also, I recently wrote
"Web Scraping with Python" (O'Reilly), and had a whole CHAPTER on APIs)

------
alexcasalboni
Many of those scripts will most likely fail within a few weeks, as their data
extraction logic is way too simplistic and based on unstable and non-semantic
HTML structures (i.e. doc.cssselect('small a')[0] ).

~~~
simonw
That's just the nature of web scraping.

------
doug1001
for aspiring journalists, i should think a class like this is a godsend--which
for those who put the work in, will have at the end of the semester, a potent
set of tools for specific data gathering (e.g., which California city mgr
earned the most last year?). Each student forks this repo and builds their own
web crawling toolbox. Kudos to the professor who conceived this course and for
teaching it.

------
thuruv
The others might failed to understand that these are the tools not the talents
to pursue their career.

