Hacker News new | past | comments | ask | show | jobs | submit login
The New York Times course to teach its reporters data skills is now open-source (nytimes.com)
427 points by espeed 10 months ago | hide | past | web | favorite | 82 comments

I found it both amusing and annoying that the first line of the article mentions VLOOKUP.

In my experience, there's never a good reason to use VLOOKUP. You can always achieve the same functionality using INDEX (in conjunction with MATCH). Using VLOOKUP means that your formulae break as soon as someone inserts a new column in the middle of your table. And clicking the cell with the formula doesn't immediately show you which two columns are used by the function.

More (opinionated) Excel tips here: https://www.encona.com/posts/excel-best-practices

That is unless you parametrise that column index as well - you can do a MATCH to retrieve your desired column number and safeguard it against column insertions/deletions. (It weakens the case for VLOOKUP, but being quite a light Excel user, this is usually how I do things.)

So you just decided you were gonna find something to criticize before you even clicked the link, eh?

Yup. Without clicking the link I correctly guessed that the first sentence mentioned VLOOKUP


I know Hacker News dislikes spreadsheets as opposed to statistical programming languages like R and Python, and there have been many startups trying to disrupt that paradigm, but time and time again they have been the most reliable tools for data analysis and collaborating with nontechnical users.

Even as a data scientist working with other data scientists, my most common deliverable is a Google Sheet.

"All enterprise software competes with Excel. All productivity software competes with emailing things to yourself."


R and Python will not replace Excel until they "correctly" transform gene names into dates or floating point numbers...


How can we have reproducibility in scientific research if we don't all make the same errors?

Don't blame excel for people being ignorant on number formatting. It's front and center in the ribbon menu.

All scheduling software competes with a pen and a piece of paper?

All scheduling software competes with a pen and a piece of paper?

According to the Wall Street Journal, yes.

The Trendy New Way to Organize Your Schedule: A Paper Planner https://apple.news/AMN1W-MtTRCecQ-qIeFUQSQ

And a corkboard, yes. Or maybe a big wall calendar.

The old clocking in system with a card and stamp from the clock we're efficient and everyone understood how it worked.

That's minimal time tracking, not scheduling. It's the difference between retrospective and prospective.

Kanban systems with real physical cards work pretty well. And you they are way more resilient than digital stuff.

Good point.

Scheduling is a subset of productivity, and, IME, not an exception to the rule about productivity software and Excel.

Or a notebook.

Thah, I dogfeed http://handlr.sapico.me for not emailing it to myselve.

It seems to work :p

Note, sapico.me (without www) is a bare Windows Server welcome page. Could redirect it to www so Google search bot isn’t confused about what’s at sapico.me.


Well, my start page isn't even mobile friendly. I don't actually bring in clients though the web page, don't need it actually.

It's just there, a refactored version ( yekyll) is already in the works

I don't dislike people using the statistical tools available to them, but in my own field (social sciences) there's a huge replication crisis going on right now. And a lot of that is due to people who were never good at math taking easy-to-use statistical tools like Excel and SPSS and blindly running stats without programming or math training.

Is it too much to ask that people treat a field with a bit of respect? Like, just because NYT reporters can use some of these "data skills", can they hold off a bit until we figure out if they're even any good at statistical analysis after their crash course? We currently have an entire academic field that has to throw away a lot of their findings because tools like sheets and SPSS gave them false confidence. I don't have any higher hopes for the NYT newsroom.

I think that’s unfairly dismissive to the data scientists, statisticians, analysts and engineers who work at the New York Times and other major publications (as well as smaller, but crucially important places like Pro Publica).

The purpose of this material isn’t to suddenly turn normal reporters into data scientists, it’s to give them a better grasp and understanding how how to evaluate different types of information that become important when reporting.

I don’t know how good or bad this material is — a cursory glance shows that it’s very low-level, the type of stuff I learned in my 100 level accounting and stats classes as an undergrad. But I won’t dismiss this material being made available and potentially augmented for all — tho I wish it was stored in GitHub or GitLab.

If you look through the material, there is nothing that actually says that someone who goes through this training will be a skilled data journalist. But it might just prevent poorly-interpreted articles like this [1] from being written.

And for the record, I’ve worked with data journalists who were more skilled in math and computer science than the engineers I work with at giant tech companies.

[1]: https://www.nytimes.com/2019/06/09/business/media/google-new...

I don't think the folks here dislike spreadsheets for working with data. I think the opposition is against spreadsheets being used to make mission critical but very brittle applications.

Also anything is better than SPSS.

In my consulting career, I've seen major organizations track data in Excel, updating each month by creating a new set of columns. An absolute nightmare to handle. Of course every month the structure changed slightly, to make it more fun. I've also seen values stored as an #rgb only - no actual value in a cell.

Nothing wrong with analyzing data in Excel. Nothing wrong with running limited operations in Excel if it is done by the same people and highly standardized.

Tracking major operations in Excel, mostly outside the existing ERP which is happening more often than not in my opinion, is big no-no.

Side note: I know people pasting screenshots from other Excel sheets into Excel sheets....

To add to that, I have seen billions of dollars in EMD traded off an Excel model that took 30 minutes to run (come in to work, hit F9 to recalc / pull in updated market data) go get coffee, come back, send trades to trading desk

Agreed, and as an entry level program I think jamovi leads the pack https://www.jamovi.org/

That looks awesome. Thanks for sharing. I never use SPSS myself but occasionally have to look at other people's work in the app and I loathe it.

SPSS and it’s native format .sav are the industry standard for survey data. If anyone is in that field and hasn’t used Q (q research software) I highly recommend it. I used to do a lot of crosstabs and banner reports in SPSS and wincross and Q saved my probably 15 hours of work in that process per survey.

Excel is an excellent tool, the most democratic data tool out there (and for the foreseeable short term future).

Whether you are a finance person working on balancing the sheets (or whatever else they do, that's beyond my knowledge) or an operation person building complex macros, it is both versatile, easy to use and yet powerful.

The byproducts of this is that it is also subverted by the genius of human ingenuity and you end up with some pretty interesting, awe-inspiring contraptions.

Such as building a calendar system but using arrow graphs to fill it out. And that's fine, except when you try to scale things up to automate the process and save people money. Now you have to do some pretty wasteful engineering to accommodate this pesky creativity we have.

That is the really interesting and a testament to being a really good software (in features and reach). It breaks the boundaries of the limited vocabulary of computers and therefore can only be fully leveraged by humans.

It's both a great reminder that most people in the world are not on the same level as the crowd around here [1] and that as the group who create tools used by such an audience, we have to be mindful of that.

On a positive note, I think there are some interesting work being done to rethink how to approach those tasks, I remember using one product in particular that had the right mix of being visually engaging while enforcing boundaries. But this won't solve the issue of users having to hack their way into getting what they want how they want it.

[1] https://www.nngroup.com/articles/computer-skill-levels/

For the Excel pros here: how do you ever template anything in Excel? Like if you want the same formula for two different tables, do you just copy-paste? How do you keep them in sync? There's just so many things I want to do in Excel and it seems so limited in what it can do that I can't fathom how non-programmers end up using it so successfully.

I've seen locked sheets used as data source for constants and formulas with variables, the cells reference those with lookups and variables.

I've come across a few spreadsheets that made me want to meet and learn something from the authors. Some have been very elaborate and surprisingly resilient.

You can look up formulas? Never realized that, that's really cool. Thanks!

I have seen formulas entered as plain text in a cell, and then a macro used to update other cells with that formula. You can then change the formula in one place and run the macro to update it everywhere.

The best way to create reusable formulas in your workbooks is to add your table to the data model (which creates an in-memory tabular cube) and write measures using the DAX formula language. This has the added benefit that a single formula can be written to aggregate data at different levels, for example, a sum can be calculated over days, months, or years. This will only allow you to share formulas in a single workbook. The data model in Excel is powerful and under utilized.

Wow I'd never heard of DAX. This sounds awesome. Thank you!

Most Excel users would copy paste a formula even _within_ the same table, and don't know that Excel's built-in 'Tables' allow you to automatically keep a formula consistent between rows.

I'm not sure whether there's a good answer to your question If there is, I'm sure that most 'Excel pros' don't know it.

I've seen folks in excel say "I'm not a programmer, so I did this in excel." and ... I was pretty damn impressed. Some of them had some of the logical thinking basics down to be pretty good at programming, if they gave it a try I suspect some might be damn good.

No one's ever told them that they ARE programming. Just in Excel.

I’m not really very facile with excel, but one place it’s really handy for me is situations that combine manual data entry with computation.

A super simple example is keeping track of monthly bills when you have flat mates, and need to split them every month. It’s not so critical that I feel the need to version control it etc, but it’s still nice to have a visually inspectable record. Even though I spend most of my days writing code, anything I can dream up that involves python or whatever just seems unnecessarily opaque and baroque. A spreadsheet is ideally suited to the task.

Always should use higher level software(Excel) till you cannoto.

I don't know who dislikes spreadsheets, I am a programmer by trade and I regularly use spreadsheets to accomplish dozens of tasks, from managing household budget to making semi-automated expense reports to keeping scores in local gaming club. Spreadsheets are an excellent tool if used properly, and more you learn about them more uses they reveal.

Until you see motor simulations running in Excel which break in many ways.

If your data can be manually entered it's decent, or if you just need to filter CSV by some columns. But anything above that please choose a proper programming language and data format.

So you're saying you should use appropriate tools for the job? Fascinating idea!

It's more work to verify all formulas that reference unnamed variables in a spreadsheet than to review the code inputs and outputs in a notebook.

"Teaching Pandas and Jupyter to Northwestern journalism students" [in DC] https://www.californiacivicdata.org/2017/06/07/dc-python-not...

> http://www.firstpythonnotebook.org/

You can also develop d3.js visualizations — just like NYT — with jupyter notebooks and whichever language(s).

"Data-Driven Journalism" ("ddj") https://en.wikipedia.org/wiki/Data-driven_journalism


"The Data Journalism Handbook 1" https://datajournalism.com/read/handbook/one

"The Data Journalism Handbook 2" https://datajournalism.com/read/handbook/two

While there are a number of ScholarlyArticle journals that can publish notebooks, I'm not aware of any newspapers that are prepared to publish notebooks as NewsArticles. It's pretty easy to `jupyter convert --to html` and `--to markdown` or just 'Save as'

Regarding expressing facts as verifiable claims with structured data in HTML and/or blockchains: "Fact Checks" https://news.ycombinator.com/item?id=15529140

Does this course recommend linking to every source dataset and/or including full citations (with DOI) in the article? Does this course recommend getting a free DOI for the published revision of an e.g. GitHub project repository (containing data, and notebooks and/or the article text) with Zenodo?

Can somebody please explain what "open source" means here? I could only find a bunch of files on a google drive, none of which have a licensing note?

"Open source" is what this sub-blogspam level article chose to call it. The NYT called it "Releasing our materials".


Are they using the term open source correctly here? It's already muddled between open source, open core, and Free software ( RMS approved ).

Couldn't they just day the course materials are now available?

Yeah I didn't see any mention of licensing. Is this CC BY-NC? CC0? Your guess is my guess.

Context: I work in the OER space and I'm interested in the material, but can only use it if there is an explicit open license attached to it.

I feel that courses co-opting the term "open source" is just unnecessary. Seems like a PR gimmick.

The NYT never calls it "open source". That is an invention of this article.

Attempts to get journalists more up-to-speed with this sort of stuff are to be applauded.

But the real problem is journalists (and their audiences) who, for a lack of professional ethics, don't give a crap about which parts of their stories can/cannot be backed up quantitatively. Besides selling newspapers, not giving a crap also has the great benefit that now they don't have to learn math, or programming, or logical thinking, or any of that.

To be clear, these are supporting course materials (syllabus, data sets, cheat sheets, etc.), not actual instruction materials. I'd love to take the course.

"... keeping track of the 3,472,382 people currently running for the Democratic nomination for president."

Is that a joke?



Any reason this links to a non-original source, with one of those annoying full screen email grab modals, when the article contains a link to the original release [0] ?

[0] https://open.nytimes.com/how-we-helped-our-reporters-learn-t...

Thanks for keeping the link and posting here!

Maybe because medium.com decided I've read to many articles this month. I currently have a couple of digital subscriptions (man they are adding up quite fast) I did not start paying for medium too.

Off topic maybe: are there resources out there for teaching coders how to journalism?

Having done both journalism and cs in undergrad and grad, respectively, I'd say the former is more nuanced.

Both are relatively easy to dabble in, both relatively hard to reach expertise.

Things like the inverted pyramid, sourcing and neutral voice will get you fairly far in terms of basic information relaying, but great journalists are specifically skilled at interviewing, data diving and other things tertiary to pen on paper.

In retrospect, I was never suited for journalism. I can write well, but that's not really a great (traditional) journalism skill. The finished product for hard news is fairly bland and paint by numbers. Anything else treads into entertainment-journalism and the kinds of things that have a bunch of people screaming "fake news." I call that Race-To-The-Bottom Journalism and it's very in vogue these days.

Consider working with journalists (evolve the pair-programming model [1] into something new ~> pair-journalism)...

"AI-enhanced journalism offers a glimpse of the next knowledge economy" https://www.niemanlab.org/2019/06/droidward-and-botstein-can...

[1] https://en.wikipedia.org/wiki/Pair_programming

I have less than great communications skills but have had great experience pairing with people who have good skills. Eg to give corporate seminars we both go onstage and the partner will structure the talk and interrupt me/clarify if I blur over something/watch for the audience’s reaction.

As a parallel, maybe journalists should pair with subject matter people (even subject matter generalists) rather than have them as “sources”. There are of course people who are great at both things (your Nate Silvers) but the whole process might be cheaper and more efficient if, gee, the New York Times would have a couple onstaff PhD economists (not star columnists) that sit in the newsroom trying to give shape to the facts about to be reported, side by side with journalists.

There are some courses on Poynter. One the best material I've seen was '50 writing 'tools' it was a serie of blog posts that are no longer online but you'll have no troubles finding it on the web.

Nice tip! I found this:


And it links to a book on the topic too.

> Roy Peter Clark... whittled down almost thirty years of experience in journalism, writing, and teaching into a series of fifty short essays on different aspects of writing

Just imho: I’ve found journalism awkward to teach because at its heart, it’s all very much just skills you use as an everyday human, with an extra emphasis on being unafraid to ask questions. My high school journalism teacher told me that if I wanted to learn journalism, the best way was to just do it. And not to go to j-school but to just work for the college paper. And I think that still applies to any working adults today. Besides doing it, being an avid news consumer is great preparation.

(disclosure: I am a former j-school professor)

As a former journalist, yes and no. When I first started, my company treated me to a 1 week internal course, that was very good. As I recall, there was a day on story structure - news leads v feature, pyramid, delayed drop etc. A day on the basics of law for journalism - libel and the like. A day on interviewing and keeping transcripts. A day on ethics,, on the record, off the record, non-attributable. There was also a dy on what subs do - headline writing and the like. All in in all a good, very useful grounding.

This feels strange to me. These are all things I learned in J-school, and instead of a day on each of these topics, we got an entire semester.

Were you doing journalism for a web site?

/Degrees in journalism and communications (not to be confused with a communication degree)

No. This was in the mid to early 1980s in the UK, for a large B2B computer magazine publisher (VNU). Journalism school really wasn't 'a thing' to nearly the same degree then. Instead you were trained on the job. My degree was biology, but I had worked a lot on the student newspaper.

I don't think this kind of induction training was typical - it was just put on by that particular company at that particular period. But it was very good. Shout out to the trainer who I still remember 30 years later. You were excellent Tim https://www.linkedin.com/in/tim-ring-33b7233

Yeah I interned and worked at 2 large regional papers and I never got trained in writing or story structure, just on how to use the CMS, LexisNexis, and occasionally the in-house lawyer would come in to do off-the-record q&a’s about legal issues

Yes, there are journalism classes and schools. There is nothing specific toward programmers for this.

Outside of personal blogging, in what context would a programmer need to know the fundamentals of journalism?

The skills of a good journalist — interviewing, elicitation, focus on critical details, and the ability to build a compelling story — are skills that /anyone/ should find valuable. But a developer or architect responsible for product, process, or service design should find it particularly useful when interviewing users and stakeholders and determining functional and experiential design.

When running a startup, and you either want the press to talk to you, or the press decides that you're an interesting story.

Larger companies have PR people to handle that kind of thing. In a startup, though, it could easily be a programmer who talks to the press, because there isn't anyone else.

There are courses on how to speak to the media. They are aimed at CEOs, politicians etc, or anyone who might need to speak to the media. Often taught by reporters, or ex-reporters.

I’ve done one, I’d recommend it. I have rarely talked to journalists, but the skills are useful in any situation where you’re being ‘grilled’, like a job interview.

That's cool. Where are these kinds of courses generally taught?

I’d google for something like “media training for executives”. You can probably get courses in most major cities.

I see, thanks!

Communication skills would be a prerequisite for journalist skills, and I see the front page of Hacker News filled with discussions on communication skills every other day.

Small things like putting the executive summary on top, starting every speech/article/email with a hook that tells the user why they care, etc. If a bunch of coders could master those skills, and combine it with data analysis and web design skills, then they could become a force that can compete with the mainstream media.

Time for me to learn jQuery and create a startup.

Why would journalists need to know data science?

It's not symmetrical like that. Journalists report on data based phenomena all the time, so understanding statistics is a fairly useful skill.

Programmers don't typically have to do journalism. They do have to write, but writing =/= journalism. If we're talking about writing classes for programmers, personally I don't think there's anything special about the programmer use case, as opposed to "writing in professional contexts" in general.

That seems like more of an exercise in framing.

If you currently work at the NYT as a data-science researcher, doing the job that journalists without data-science experience can’t do, and you see your job imperilled by this, you’ll want to do exactly what these journalists are doing, but in reverse: Expanding their role to something they didn’t previously do in order to stay competitive in the job market.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact