Hacker News new | past | comments | ask | show | jobs | submit login
In Case You Wondered, a Real Human Wrote This Column (nytimes.com)
129 points by l_adams on Sept 11, 2011 | hide | past | favorite | 63 comments



What this article doesn't tell you is that the human who wrote it works for Narrative Science.

For those who don't know, if you see a story in your local paper, and it doesn't involve a car crash, crime, weather, or sports, it was probably placed there by a PR representative. Most of the things you read are not the result of random reporters deciding to cover X or Y, but a paid, concerted effort to place story X or Y in the paper by providing the paper with a fully pre-digested story to perhaps rewrite, or perhaps not.

The words "narrative science" appear 14 times in that story, including such clunkers as "To generate story “angles,” explains Mr. Hammond of Narrative Science...." when Mr. Hammond has already been introduced earlier in the story. It even includes pricing: hey readers, this is not only cool and will win the Pulitzer Prize, but it's cheap too! No mention of competitors... It reads like an ad because it is an ad.

This story was provided, probably almost word for word, by a PR person to the NYT reporter.

I'm not sure if computer-generated text will be better or worse than the media system we have now.


I'm not defending the story as a masterpiece of journalism or anything, and I'm sure it was pitched to the reporter by an interested party. But you shouldn't blame 'such clunkers as "To generate story 'angles,' explains Mr. Hammond of Narrative Science...." when Mr. Hammond has already been introduced earlier in the story' on PR agencies. They don't actually write the copy that runs in the NY Times. That particular clunker was probably imposed on the writer by the copy editor ("It's been a while since Hammond was referenced; we need to remind the reader who he is"), unless the author has internalized the style himself.

This story was provided, probably almost word for word, by a PR person to the NYT reporter.

Definitely not. The idea for the story was provided to the reporter, probably by a P.R. person. The reporter conducted interviews with representatives of the company, some of whom are quoted in the story and some of whom aren't. The reporter went away and wrote up the story himself. It was edited by at least one line editor and at least one copy editor. The reporter, line editor, and copy editor have all beaten many competitors to obtain jobs at the most prestigious company in their field.

If anyone at the New York Times were found to have submitted a story that was "provided ... almost word for word by a PR person" that person would be fired and the paper would issue a public apology.

Again, I'm not defending the story, and I'm sure the PR person who pitched it was thrilled by it. But, you know, Steve Lohr's byline is on this story, and you've accused him of pretty bad professional misconduct, and I don't think that's warranted.


Markoff once ran an article about MyWeb (yahoo's competitor to delicious) that basically took everything from PR instead of actually doing any research. No mention of competitors, let yahoo take credit for stuff other people invented, etc.

It happens more than you think.


That doesn't surprise me at all. What would surprise me would be if he copied the actual text provided for him by a PR firm, as the OP suggested Lohr had done.

Like every other daily newspaper, the Times produces some rushed, lazy journalism (as well as some very good journalism). But there are certain lines they don't typically cross, and literally taking dictation from PR is one of them. You might argue that the difference isn't meaningful, and that rules like "Don't just copy out someone else's text" are a fig-leaf to hide bigger problems. I'd have a lot of sympathy for that argument. But if we're going to criticize the Times we should criticize them accurately, for the things they're actually doing wrong, rather than accusing them of doing things they haven't done.


Exactly right. Quick bit of advice for startups here:

Last fall, the Big Ten Network began using Narrative Science for updates of football and basketball games. Those reports helped drive a surge in referrals to the Web site from Google’s search algorithm, which highly ranks new content on popular subjects, Mr. Calderon says. The network’s Web traffic for football games last season was 40 percent higher than in 2009.

This planted PR piece uses a bit of customer data to demonstrate that the technology makes great business sense. So far so good. This particular piece of data, though, will be read as "Markov chains creating content to spam up the search results" by folks inside the Googleplex, and it makes them look stupid. Don't make Google look stupid. If you do make Google look stupid, don't brag about it in the NYT. It will not end well.


It only makes Google look stupid if it's bad content that gets ranked highly. I've taken several classes with Birbaum and Hammond and the goal isn't to make bad content to fool search engines, it's to make content so good it seems as if a human wrote it. If that's actually the case there's no conflict in Google -- good content gets pushed to the top.


it's to make content so good it seems as if a human wrote it. If that's actually the case there's no conflict in Google -- good content gets pushed to the top.

There's an easy way to achieve that: have an actual human write it. This solution does not necessarily win one good will with Google: for one obvious example, most of the content farms were farming manually rather than farming with Markov chains.

I also think "good content" only craters the approach to the bridge of describing both a) what actually ranks on Google and b) what, in an ideal world, Google thinks would rank on Google.


Maybe they are so good they hope Google can tell their stories from human made. Although I wouldn't bet on it.


Google is already working to attribute authors to content, think of it as PersonRank. With verified real people (they are getting many signals for this such as android, chrome, checkout, places, voice, plus, etc) they may be able to combat spam like this by ranking news/stories based on the history of the author.


A single article probably not, if you examine a site producing these in high volume though you would probably notice a high level of similar words and constructs used. Even just by the speed at which certain things are posted, if something is consistently the first source after something like a sporting event they are either doing a lot of pre-writing or are generated.



Ha. That article led to my first correspondence with PG. Scroll down to the bottom for the spoiler.


This is true. I've been told by PR people that 80% of the content in a newspaper or a magazine is written by PR agencies.

I have a question, though. Why don't they also create an expert system that writes PR pieces? I imagine it would be really nice tool for people in the business. You write down the variables (who, what, when)from an interview with a client in a form and you get to show him an instant first draft, you correct it together and you publish it after leaving the room.

Or even better yet, take that data and use advanced algorithms to embed mentions of clients, products and initiatives in bigger articles.

This is how a PR piece usually goes:

1. Case, problem, solution. Short story here.

2. Introduction of company

3. A Question and an Answer

4. Another look of the company, credentials.

5. This is really cool think about it. There may be a sound (text) bite here.

Do you see the patterns?


Why don't they also create an expert system that writes PR pieces?

It's cheaper and easier to just use the endless supply of interns.


Unpaid interns? Ouch. I didn't think of that.

Although there are hidden costs even behind unpaid labor. Office space, HR costs (hiring isn't easy), training, etc, etc.


It's not hard to hire people to write PR stuff. It's just unemployed english majors of which there are legions. There are no associated costs as the positions have no benefits and everyone works from home.


...and the interns get to say "I've written for New York Times" the next time they interview for a job. Everybody wins, as The Economist argued: http://www.economist.com/node/21528449


I've been told by PR people that 80% of the content in a newspaper or a magazine is written by PR agencies.

Then again, PR people have a strong commercial interest in making it appear so that only stories pitched by PR people get published.

My (few) experiences in getting stories through have been to the contrary, usually just journalists picking up something I blogged. An exotic example: http://bergie.iki.fi/blog/on_usb_fingers_and_world_news/


There was a similar type of NYT article written about StatSheet last November that had the right amount of skepticism and optimism: http://www.nytimes.com/2010/11/28/business/28digi.html

Automating long-form narrative content is a relatively new concept. It's not perfect (yet) so I appreciate the skeptics.


My guess is that in this particular case since two of the leadership (CTO and CSA) are journalism professors they didn't even need a PR firm to do this. It's possible to contact journalists direct obviously (and I've done that and landed on the cover of the WSJ NYT etc.)


I was wondering why a newspaper would so positively cover a company trying to outsource reporters' jobs to machines. "In Case You're Wondering, A Real PR Person Wrote This Column."


Same thing that happened with coverage of the Internet back in the 90's. And even if it wasn't fairly obvious what would happen back then to newspapers it certainly was a few years later when it was already happening.

Reporters are usually liberal. And they come from a time where newspapers were profitable and could be idealistic and still make money. They could answer to a higher standard because they didn't have to worry about losing their jobs for money reasons.

As far as right now the ship is going down but they still don't view the threat.


That's a jaded and misguided view of the world. Very few reporters will accept articles written by others, and they predominantly work for trade rags, not the New York Times.


Please tell me you read that article and didn't find it odd how extremely one-sided and positive it was.

Also, it happens all the time. And with a publishing machine as big as the NYT, it's bound to happen there as well. Remember Jayson Blair? Different symptom, same cause.


There's a difference between PG's submarine articles and the PR person literally writing the article.


Oh, sure, maybe they don't just accept articles written by others. But they will base things heavily off press releases. That's what press releases are there for. If reporters didn't bite, nobody in the big leagues would bother with them.


It's not just newspapers who push PR pieces. This also applies to the web (Techcrunch and Wired comes to mind).


I love seeing these examples of product development: begin with a very specific niche at the edge (not tackling the mainstream head-on) and "target non-consumption" - that way, you have no competition; and it's not a zero-sum game where you beat someone, but creating value that never existed before. This is possible not because it's good, but because it's cheap (and good enough):

> primarily a low-cost tool ... for local youth sports .... and financial results of local public companies ... “Mostly, we’re doing things that are not being done otherwise,”

Then, once you have some customers - any customers! - you improve it, bit by bit. It doesn't need to be perfect in the first place; it doesn't need to be perfect in the end. It just needs to be good enough to be useful.

> [customer] worked with Narrative Science for months to fine-tune the software

As for the technology itself, we're not told anything of its details, just what it can do. This is a marketing article, not a tech report. It would be interesting to see the models they use for stories, and whether they use grammars for the overall structure. These are very narrow domains, which are the easiest to start with: you could enumerate all the standard cliches, understand when they apply, and tweak the model. That's where the journalist expert domain knowledge of the two founders would come in handy. BTW: "easiest" is only relative - it would still be very difficult (almost impossible), and kudos to these guys for actually doing it - and even better, making an actual business out of it.

It reads like a 50's Asimov story - the future is finally arriving.

But a Pulitzer in 5 years is absurd, either cynical puff or visionary bravado. Theoretically possible, I think, maybe in 50 years - the figure I've long given for strong AI. ;-)


Curious, what do you mean by "target non-consumption"?


It means target people with needs that aren't being met. Those people are not consuming (i.e. buying and using) a product or service to address their problem.

It's clearest to see when a product exists to solve a problem, but it's too expensive for some people (or some situation). In the article, the problem of reporting on local sports/financials has a solution (reporters), but the value of that news isn't worth their time: their time is too expensive. So the newspaper doesn't "consume" a solution to the problem of reporting that particular news.

By targeting this non-consumption, the startup doesn't compete against reporters (yet...), so it provokes no desperate fight for survival.

It's an term from Clayton Christensen, who wrote The Innovator's Dilemma, though he doesn't use it til "The Innovator's Solution", and expands on it in "Seeing What's Next".


Did they write an entire two page article while ignoring the real leader in this space, in my opinion: http://statsheet.com/


A computer never would have missed that.


My worry here is computers will learn to write articles specific to every individual. The computer will know what other articles we liked and what we didn't like and just try to write to what we want to read. This will make it even less likely we'll hear an opposing view to our own, if the computers are giving us what we want to read.


We don't need computers to do this, people do it already. Most "news" channels spew out such a monotonous train of thought, that it makes me wonder if switching to a news channel these days is even worth it. For now, I am able to get my news and information online from a variety of sources that offer me interesting views on both sides. Even if one day all of this is content produced by machines, I think its the individual's responsibility to make sure that he/she gets information from both sides.


If a computer had written this article, maybe it would have mentioned how useful this technology is for spammers.


I've said it before, sufficiently advanced spam is indistinguishable from content. If this comes to pass, I'm not sure who will have won the spam wars, but I'm. Inclined to say that the low end content authors will have lost it.



I'm skeptical of the claim that a program could win a Pulitzer. How does it decide what to write about, who to interview, and what questions to ask?

Reporting a day at the races or the markets is easy because we know which kinds of data are relevant and we have them available.


It's not inconceivable that, as AI advances, at some point there with be algorithms that figure out the questions you have posed.


True, it's not inconceivable, but arguably having a system that is an effective investigative journalist is AI-Complete:

http://en.wikipedia.org/wiki/AI-complete


This is true but like most things has to do with AI has limitations.

In some fields (eg, finance) one can conceive a computer based process that would do a better job than most investigative journalists. It can't deal with missing data, but it can discover inaccuracies, unusual events and suspicious patterns and in some limited fields this is enough.

For example, an AI based process might have been just as good at finding the problems at Enron as conventional journalists were (since it the problems there were mostly uncovered by forensic accounting on their public balance sheets):

But hard information was scarce. "It's almost as if you have to use forensic accountants when you're doing a company story because many companies are using very aggressive accounting techniques that are perfectly legal," Shepard says.

http://www.washingtonpost.com/wp-dyn/articles/A64769-2002Jan...


I wonder if these automatically generated articles will ever become good enough to be worth reading. Currently, they seem to be just good enough to fool Google, and convince people to click the link. Do any sports fans bookmark and come back to these sites?

No matter how good the algorithms get, they are still limited by their input, the statistics. If for example a player scores a very unusual goal, say a bicycle kick in soccer, then a real writer who actually saw the match would surely mention it. An algorithm could not if there is no field for unusual goal in the match statistics.


I think I'd rather see more innovative ways of representing the history of a sporting event graphically instead of trying to replicate the relatively inane blow-by-blow that you currently get from real sports journalists.

Maybe though if this kind of thing continues to improve the humans will have to start doing more serious analysis instead of fluff coverage.


Still a long way to go, but the content is getting better. Here is a recent article from StatSheet: http://bosoxball.com/boston-red-sox/game-recap/red-sox-defea...


Here's a description of my venture into this territory, in which I generated formulaic lottery result briefs:

"I wrote this article with one mouse click"

http://coding.pressbin.com/60/I-wrote-this-article-with-one-...

I can't imagine the sort of code base that would be needed to make these stories not seem formulaic.



Page One [1] is a browser extension that automatically redirects you to the single-page version of articles on popular news sites, including NYT. Works great for me.

[1] http://globalmoxie.com/blog/page-one-safari-chrome-extension...


But doesn't work with Firefox.


ObXKCD: http://xkcd.com/904/

There are certain topical areas which lend themselves to automated content generation. Sports, financial news, weather, astronomy (astrology isn't worth mentioning), earthquakes and other severe events, machine monitoring.

Domains in which a quantified or measured outcome tied to a specific point in time or event (final score, market close, daily forecast, etc.) occurs. The important data has already been highlighted, all you've got to do is sprinkle some syntactic sugar around it.

Oddly enough, these are areas in which you're already most likely to find existing "AI"-type content generators.

In areas in which you've got to do significant determination of what is salient, the approach isn't nearly as successful.


This is a recent email I got from Facebook Support team regarding a vanity url for my business. I could swear this guy is a robot or a script, and I wonder if Facebook is using the technology described in the article:

----------------

We’re sorry, but we’re unable to process your request because another entity has made a previous request concerning this username. If you are still interested in claiming the username, you may contact us in 60 days for an update about its availability.

---

You have reached the right channel for these requests. As mentioned earlier, we have no further information to share with you concerning the username "xxxx" (marked out). We will be unable to assist you further from this alias.

----------------

What human being talks like that?


A human who had to write several dozen support answers?

Naming collisions have to be a common occurrence for Facebook. It’s sufficient to write exactly one mail for such cases. There is no need to re-write or change things around, it’s always the same answer to the same question.

Using a robot to write stuff like that seems wasteful – I don’t even think it would currently be possible.


It's probably a template. Someone read it, and picked "template-name-conflict-XYZ" to respond with.


I suppose this may do for articles that just deliver some facts. However the kind of stuff I enjoy reading doesn't just barf up some facts in the form of sentences, it provides insight into what the implications of those facts may be and also draws from the past to better put things in context.

That's not to say their technology couldn't be improved to search the web and see what past events are relevant, but providing good insights about the implications of the facts will be a whole lot tougher. I don't think journalists need to be shaking in their boots unless they only deliver the quality and depth of results that this algorithm delivers.


These technological advances made me shudder about the potential job loss of the future even though the previous technological advances created new jobs.

Sure, there's no way that my profession and the great majority of jobs on the internet would be possible if we rely on human switchboard operators rather than relying on automation. That doesn't mean it will be true for the next advances in technology, does it?


If the cost of producing even the sort of dry, statistics-heavy content the program presently excels at was a primary factor then we'd have outsourced it to India or the Philippines by now. You'd certainly pay less than $10 for an article like this: http://www.builderonline.com/local-housing-data/new-england/...

I'm willing to believe the underlying machine learning technology is very clever, but I'm also willing to believe a specialised toy script could produce similar results, even if you had to hard code the minimum winning margin for a "rout".

As for the Freakonomics comparison, they seem to have missed the appeal of Levitt: that his ability to posit a plausible causal relationship between two apparently unrelated variables. Any idiot can summarise "remarkable findings" based on spurious correlations.


Fun idea: produce similar autogenerated narratives on commit activity of various open source projects

Bergie had a strong start on the office day, closing four bugs in row. Then luck turned and he broke the build...


The problem I have with that line of thinking is it encourages halting the progress of technology based on fear of the unknown, a classic human foible.


This has been a common fear in the face of pretty much every new technology. All past holders of this view have been proven incorrect. Do you want to bet that today (or tomorrow, or next month) is the first time in all of history that it will be correct?


This is pretty fascinating stuff, despite the limitations and obvious bias of this article. Are there any Open Source libraries or papers which cover toy implementations of this sort of thing? (Assuming, of course, that it is not simply a bunch of if/else constructs applied to templates, which would be far less interesting.)


This reminds me of what MarketBrief is doing for financial documents. Definitely less color / variance in the stories though.

http://techcrunch.com/2011/08/15/yc-funded-marketbrief-makes...


For those interested, the best source of research in this field is the "Special Interest Group on Natural Language Generation": http://www.aclweb.org/anthology/siggen.html


If this works as advertised, it would have important (bad) consequences for SEO, right?


Making a note here: add some shiny around my Python template engine and I can land $6m investment.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: