Hacker News new | comments | ask | show | jobs | submit login
Journalism generated by machine is on the rise (nytimes.com)
123 points by pseudolus 11 days ago | hide | past | web | favorite | 59 comments





I'm a former journalist and spent the first 10 years of my career at various magazines and newspapers.

About eight years ago, I landed my first full-time software development job in part because of a walkthrough I gave during my in-person interview process of a tool I'd developed while working for a metro daily newspaper. It wrote lottery results stories with the click of a button.

Writing those results stories was a daily chore. We would look up the state's daily lottery results and also report Powerball and Mega Millions results. For those larger lotteries, we'd also adjust our headlines depending on whether there was a big winner in our state. But aside from that, it was just tedious and formulaic, so I wrote a script that would hit the various lottery sites, scrape the data, and generate a two- or three-paragraph story that was ready to post. It took the task down from five minutes to five seconds. (I later worked on a similar tool for generating short weather reports and alerts.)

Since then, I've worked on other NLG stuff, and honestly, it's pretty hard, and the topic really has to be data-driven to begin with; we're probably decades away from really insightful NLG that offers "why" explanations, as opposed to "what" or "how" descriptions.

The other tough challenge about computer-generated journalism is that getting the data is often the hardest part of the process. Oh, sure, businesses are going to release their quarterly earnings reports, and sports teams are going to release their game data. You might even use computer-assisted reporting to generate FOIA requests for data that governments are required by law to release upon request. But you're not going to write a story that leads to Nixon's resignation or Enron's collapse simply by asking for the data.


> we're probably decades away from really insightful NLG that offers "why" explanations, as opposed to "what" or "how" descriptions.

For any specialized field, doing this would require something more than the kinds of schooling journalists get. It would also step on a lot of political toes: For example, the current national debt is $X. Why is it that big? Is that number a problem? Say anything concrete regarding either of those questions and you get screeching. Just endless, wordless screeching, like the Pod People in Invasion of the Body Snatchers, only directed into your Letters to the Editor column.


To be fair to the journalists, those are actually open research questions.

> To be fair to the journalists, those are actually open research questions.

There are strong consensuses about some aspects of those problems, however, especially regarding the benefits of using government debt to finance stimulus programs as part of a counter-cyclical action against a recession. Saying it's all unknown is about as intellectually honest as saying that we don't know whether the world is flat or round because we're unsure of its topology at the millimeter scale.


>>There are strong consensuses about some aspects of those problems, however, especially regarding the benefits of using government debt to finance stimulus programs as part of a counter-cyclical action against a recession.

There is no consensus that counter-cyclical stimulus is beneficial, outside of those economists already committed to central economic planning of the money supply. Basic economic theory tells us that spending directed by political forces is likely to be less efficient than spending directed by market forces. Counter-cyclical stimulus uses future earnings of the private sector to pay for current government spending, so assuming all other factors remain equal, the price is less market-directed spending.

Moreover, counter-cyclical stimulus interferes with the development of the natural corrective processes of a free market, because it means those who saved cash during the credit bubble have less opportunity to buy up assets when the bubble pops. By reducing the capital allocated to this group, you reduce their future influence on the economy, which would have the effect of reducing the magnitude of bubbles, and with it, economic volatility.


Economics, like all social sciences is nowhere near as sound as the hard sciences. You simply cannot have control groups for many experiments so you are left with case studies, which are often unconvincing. For example, does a true free market really exist anywhere in the world? How about a 100% command economy? How can you honesty determine if the federal reserve helps or hurts the economy?

> Economics, like all social sciences is nowhere near as sound as the hard sciences. You simply cannot have control groups for many experiments so you are left with case studies, which are often unconvincing.

You can say the same thing about paleontology, but do you seriously doubt the past existence of non-avian dinosaurs?

> For example, does a true free market really exist anywhere in the world? How about a 100% command economy?

You don't need perfect examples of things to observe what happens when economies closely approach those extremes.

> How can you honesty determine if the federal reserve helps or hurts the economy?

By observing how horrible the business cycle was prior to its existence.


I see disagreement by downvoting is in effect, which discredits the disagreement.

My side project is a local news outlet for two small cities near me. They are small enough cities that I am the only game in town when it comes to digital media, the only competitor is the local newspaper who has no online presence. Plenty of our stories are actual news articles talking about local politics and new businesses/restaurants and local interest type stuff, all written by humans.

But filling in the gaps is plenty of machine-generated content. One of our big draws is the event calendar, and fortunately events give me enough data to feed an AI that I've developed. There is enough structured data and the information is so routine that this system can churn out several articles per day that sounds hand-written and every article sounds custom tailored to the business and the event itself. This fills in the routine work of keeping content flowing in between the stuff that takes a lot longer to research and write, and makes it viable for us (one full time employee and one part time employee) to run two popular small-town news websites where otherwise no one would even bother trying.

This is also a great case where classical AI is still relevant to modern problems. Neural networks just are not good at writing English in a way that humans would enjoy reading. At some point I plan on packaging it up and doing a Show HN, but for the time being, this article is spot on. Machine-generated news content breathes whole new life into an area that's increasingly hard to turn a profit.


I've been working on something along the same lines for the past couple of years. I would love to exchange some ideas with you!

Send me an email at moura at my the domain in my profile


Id be really interested in your technical approach to this. How did you implement this in a high level?

The core of it is a decision tree tied to a pretty in-depth database that either knows everything about the city or seeks to learn everything about the city. The system reads an event on the calendar that says "Van Halen is playing at American Brew Pub on February 12th at 7pm" and starts running running through the decision tree pulling information out of the database to fill in the blanks on the phrases it's picked.

Then it strings them all together to write something that looks something like:

>Do you have plans on Tuesday? Well now you do! Van Halen is playing at American Brewpub at 7pm! Van Halen is a local rock band that we love. They have been to our city before, but they played at Next Door Music Venue (link to the previous article for Van Halen at Next Door). This time they're playing at our favorite brewery, so you can watch the show while drinking Local IPA! Tickets are required, and you can purchase them here (link to tickets). We'll see you there!

People love these articles but they're not super fun to write every day (sometimes twice a day for the two cities), it's just routine stuff.


The concept seems pretty solid, but reading the output I feel like this is bordering on unethical. Maybe this is just a bad example, but it seems you’ve gone past programmatically giving people information, and into giving the impression you’re actually endorsing things. Unless you actually add metadata to the artist saying you love them, or rate the venue to indicate it’s good for live music, you’ve got no way of going beyond “Van Halen, a local rock band, are playing at American Brewpub at 7pm Thursday. You can drink local IPA while watching them, and buy tickets here.”

This is a very simplified example written by human hands just for this post (not machine written). The database is immense and has far more fields than you might imagine. It takes into account how many clicks content with Van Halen has had in the past and how many clicks content with American Brewpub has had etc to figure out how likely this article is to resonate with our audience, plus how many human-written articles have been published about these subjects and sentiment analysis to figure out if our articles were positive or neutral (we don't write negative articles, if we don't like a business/event we just don't talk about it), which impacts how the machine-generated article is phrased or even if it gets written at all.

We have about 90 events on our calendar per city in any given month, and this AI writes about three or four articles per city per week even though it's running constantly. It's pretty selective and very rarely writes something that our two human writers wouldn't have mentioned otherwise. Our personal preferences and those of our audience are certainly taken into account.


That’s incredibly cool, and definitely covers my concerns there. Good work on giving it some restraint rather than going for sheer volume as well.

Maybe the AI also considers ratings of the band. Then it would be no more unethical than online shops displaying customer satisfaction ratings?

In any case the AI could make several educated guesses on how good the band is. Quality of the venue would be another indicator.


> The program can dissect a financial report the moment it appears and spit out an immediate news story that includes the most pertinent facts and figures.

As an accountant in a prior life, I can tell you that this approach won't provide anything near "the most pertinent facts". That's because public earnings reports are written specifically to circumvent automated analysis. Wall Street firms have tools in place to scrape the data tables from those PDFs, and even then the technology isn't perfect, because layouts aren't standardized (merged cells, table built in InDesign, etc).

At best, it will get the headline numbers right, which are meaningless without context. What does a quarterly profit of $3.3m for MSTR mean without benchmarking against its competitors, or taking the macroeconomic environment (interest rates, etc) into account?

The headline numbers on the balance sheet, P&L and cash flow statement don't say nearly as much as the notes to those statements, which often contain minute details that are extremely relevant for investors and analysts. Trained accountants and analysts can miss details when reading through those, so how is AI going to parse it any better?

Unfortunately, this smacks of "quantity of articles" over quality of analysis, the latter of which represents why journalism is valuable and necessary.


The NLG companies aren't parsing the data from financial reports. They're getting clean, structured data (and context) from specialized data providers like Zacks, and its equivalent in other industries: https://automatedinsights.com/customer-stories/associated-pr...

==What does a quarterly profit of $3.3m for MSTR mean without benchmarking against its competitors, or taking the macroeconomic environment (interest rates, etc) into account?==

This doesn't seem that difficult to implement into a story. There are 125 earnings reports today [1], a retelling of the headline numbers with competitor analysis should fit the bill for reporting on most of them.

==Trained accountants and analysts can miss details when reading through those, so how is AI going to parse it any better?==

How could we expect a journalist to parse it better? Is that really the goal of news on financial reports?

[1] https://finance.yahoo.com/calendar/earnings/


Wall Street firms get financial feeds from zacks, edgars, morningstar and a whole slew of other financial data providers. Finance sites, like yahoo finance also get their data from these sources. They don't have to scrape anything for data. It comes in structured CSV, XML, etc format already.

Also, journalists are supposed to provide the "facts", not analysis. They aren't financial experts so even if they provided analysis, I wouldn't put much stock in them.


> Wall Street firms get financial feeds from zacks, edgars, morningstar and a whole slew of other financial data providers. Finance sites, like yahoo finance also get their data from these sources. They don't have to scrape anything for data. It comes in structured CSV, XML, etc format already.

Data quality varies from vendor to vendor. Additionally, speed is a factor in how profitably some strategies can be executed. When firms are examining bits on the wire to guess whether earnings were good or not (before the full headline arrives), you can’t necessarily wait for the vendors to update their releases, especially since all of your competitors will have exactly the same data.


Check out the Excel version of the 10-Q report for Tesla, generated by Morningstar Research software [1]. Lots of merged cells, with some column labels appearing one column adjacent from the data it's describing. And you still need the HTML/PDF version because the Excel includes only the numbers, and not things like risk disclosure or management's comments about the earnings.

As 'structured data' goes, it's nowhere near where it needs to be to support 'instant article generation' beyond anything but the shallowest headline numbers. The consolidated financial statements all have appendices (notes) with relevant details. The case study you linked to simply indicates only that reduced man-hours significantly by automating the process of manually picking numbers and putting them into an article template. They may as well be putting out a press release.

Although the topic of the post is how automation affects financial journalism, it bears mentioning that an analysts' job is to reverse engineer the report to see how they arrived at those numbers. The vendors' auto-generated reports never include formulas, so you'll have to be doing your own calculations as part of your due diligence.

For instance, if they're trading at a high P/E ratio, how much of that is due to positive investor sentiment and not related to recent buybacks? The headline numbers won't reveal that, but past data and the notes to the financial statements usually will.

If their cash balance says $x billion, how much of that came from convertible bond issues that are coming due in the next 12 months?

[1]http://ir.tesla.com/sec-filings?field_nir_sec_form_group_tar...


^ can confirm the above assertion about in-report tables being created as vector graphics to stymie scraping efforts

Just like HackerNews comments.

Somewhat seriously, I know someone out there is posting AI generated HN comments and testing it here. With the proper timing/rates/etc... it wouldn't be hard to avoid easy detection. I don't have specific accounts in mind, but I have a hard time believing no-one is trying it out (given the overlap of HN with AI enthusiasts).

So the real question is: can anyone detect the AI comments?


The thing of it is I wonder if we, really, have a distinction.[b]

As this[1] article puts it: 'And it also illustrated how much people tend to anthropomorphize AI, believing that it has deep-seated beliefs rather than seeing it as a statistical machine.'

But, really, have we proved there is anything to such romantic or spiritual notions about human beings, or are we just 'statistical machines'?

Anyway, my test for an AI-generated comment: determine a measure for 'sensicalness', the higher the score (aka, more sensical) the higher the probability of non-human origin.

[1]https://www.technologyreview.com/s/610634/microsofts-neo-naz...

[b] Really, I think our definition of human comes not from the mind but from the body. That's always the definition we deploy, whether it's for or against racism or abortion or...anything else, really. Even a brain in a jar is still defined in terms of being a brain in a jar. This is probably why the internet, as was known before, will go away, will cede to the 'video-sphere', when we invented writing, (such as this message) we divorced the content from human embodiment, so we could never be sure, even all the way back then, if we were looking at something composed by man or gods or...anything.


>As this[1] article puts it: 'And it also illustrated how much people tend to anthropomorphize AI, believing that it has deep-seated beliefs rather than seeing it as a statistical machine.'

I would go in exactly the opposite direction. AI does have deep-seated beliefs because the programmers who input the training data and label it have deep-seated beliefs, as does the culture the content is drawn from. I'd say it's much more likely that AI is more human than philosophically ignorant scientists obsessed with mechanistic empiricist dogma would let on than it is that humans are just 'statistical machines'.

For instance, AI identifying some women as men (and some men as women) show that it's just as human as the rest of us - it was trained on data based on squarely modernist gender appearances.

This is a good article that touches on the issue: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3078224


Interesting! Please tell me more about the can anyone detect the AI comments?

Naw, doesn't work. There aren't enough HN posts that declare something interesting and asks for more information. You need an AI that explains why anyone can detect the AI comments is a broken idea that will never work and also sucks.

Does humor signal humanity, or does a reference to a popular line from a show famous in nerd culture show lack of creativity?

No matter the answer, I was entertained.


Not aware of the show you speak of, was just parroting poorly written chatbots.

As for humor, that's one thing AI is not going to be able to do well for many years to come because it requires too much creativity. But as a Turing test it's not very good - some people are just fundamentally unfunny.


Check out Reddit's Subreddit Simulator (https://www.reddit.com/r/SubredditSimulator/). It's a fully-automated subreddit where only bots can post, and everyone makes their own bot to post automatically generated comments.

The thing is the majority of what humans consume is so structured [tv procedurals, romance/SF/fantasy novels, superhero comics, news, etc, ad nauseam] is so tightly structured you don't even need an AI to generate it, even just a markov generator will do it most of the time (someone had a project that generated quotes indistinguishable from 50 Shades of Grey). They'd already written screenplays on computers at MIT in the 60s. But even the high-end isn't saved; if Burroughs were alive today and in his 20s-30s, he'd probably being using computers for a modern generated 'cut-up' technique. Really, Dada was just ahead of the curve by a century.

You're not going to get a (remotely decent) novel out of a Markov chain text generator unless you're using tuples so long that you're just regurgitating an existing novel.

Not Markov chain, but I have been wondering about Dwarf Fortress. Some stories of Dwarf Fortress sound quite interesting. So I wonder if an automated writeup of an automated Dwarf Fortress game could yield something worthwhile.

>You're not going to get a (remotely decent) novel

An actual implementation of such would be wonky?[1] Yes. But decent is a relative word. People are quite willing to put up with a lotta issue to get what they want. (The churn of quick-shift material in the self-publishing world is enough to demonstrate this.)

But, in relation to the nature of text vs video as I mentioned in another comment, the integrity of the written word probably doesn't matter. If video dominates, the words needed in a script need be no more than a generalized layout of plot-points filled with a bit of ad lib and improvisation. (Given the nature of so much 'reality tv' you don't even need that, simply impress the images and arrange them to the pre-defined consumer-expected structure.)

We really overestimate the importance of the novel. Tweets could be generated easily (and are), and such snippets, I'll argue, are to most-consumed scriptorial content. And visual media predominates. The novels that have wide-ranging effects are things like Dan Brown, which are structured in just such the way as they are much easier than would be suspected to be systematized (As most people will admit: he's really good at writing the same book over and over again).

And the people who care about 'remotely decent novels' beyond their own engagement are few and far between. They have no mainstream cultural value (in the US) beyond shock value.

[This is not to be taken as me knocking novels. Also, I may be too US-centric, but I doubt the rest of humanity is any less dominated by the arresting nature of the visual, or, again, they interact heavily through messaging apps that count in the snippet category.]

>regurgitating an existing novel

That's what most media is. (Which is not to be taken as a slam; a given culture has to repeat or it's not a culture.)

[1] Am I going to absolutely bet on that it would work? No. But I've done some work in this area, and I still will contend the distance between theory and praxis is much smaller than anyone wants to admit. The tooling would be much simpler than a full-blown AI.


Is AI the death of human creativity?

Assisted AI apparently now designs our cars and our planes, designs our logos, designs our buildings, generates our music.

More significantly, un-assisted AI curates what we are exposed to in the form of what's shown to us in our feeds and suggestions, and judging by the articles in OP, already is capable of writing articles for us. AI voices and news casters seem to be getting better by the day.

I can't tell where this is going anymore.


I'd argue that this type of AI frees up humans to be more creative. If a journalist doesn't have to spend time writing routine articles about baseball scores, they have more time to spend tracking down real stories. I posted elsewhere in these comments about my own AI-powered news system, and if I didn't have that my entire day would be taken up writing routine articles that are interesting to readers but have very little journalistic value. Instead I can turn on the machine and let it write this routine content while I go out and interview the city manager about the proposed development downtown or the upcoming tax increase.

If the routine and mindless tasks of humans are moved to automation, those humans are now free to actually create.


While I do agree with your argument to a point, it breaks down because the internet is a mountain of crap, and while attention is a limited resource, the crap is limitless.

You say you use AI to generate mundane articles. Well, your mundane AI article about baseball and 10,000 other AI generated articles are competing with quality content written by real people.

There are people who are passionate about baseball, who went went out to watch that baseball game and write about the passion behind every play and every ball. Possibly interviewing the crowd and the players.

Your AI is probably learning from those articles and getting better at faking that passion. So much so that in a few years maybe people won't be able to tell the difference.


In most cases, they are not competing, because the primary niche for NLG is to generate content that was never economical for a human to take the time to create -- minor leagues, fantasy football, etc.

I think we'll see an explosion in AI created works, and then a countermovement to only consume things crafted by humans. Sort of like what we've been seeing with "organic" foods but for music, art, and news.

You will also (and already do) see deliberate efforts to disconnect from the artificial, virtual world and reconnect with the physical world: unplug retreats, etc.

This is like the argument that photographs are the death of art. Turns out they didn't even kill figurative art.

I would argue that abstract painting became popular about the time photography become somewhat common.

Photographs took up the mantle of depicting reality and painting became more about visual expression of feelings, perhaps.


I was under the impression almost anything that is trigger-based and has some data points is automated these days. Severe weather, sports statistics, poll results, finance numbers. All of that is fairly automate-able. They were talking about this 10 years ago. I'm not sure how to automate much of the rest since the good stories require some kind of context and talking to people on the field.

I just hope we don't lose too many reporters who go to city council meetings and court hearings. Such places need a human present to understand the context of what is happening and can spot corruption as it occurs. The dark side of automation is where people figure out ways to game the system.


> Such places need a human present to understand the context of what is happening and can spot corruption as it occurs.

We have some interesting AI projects going for spotting corruption as it occurs here at Brazil, both government backed and fully private ones.


It's worth asking whether "thousands of articles on company earnings reports each quarter" constitutes journalism.

I think the obvious answer is "if it can be done by a machine today, it's not journalism". It's stuff that journalists are expected to do, but it's busy work that takes time away from actual journalism.

Yeah I would assume that there are some interesting stories hidden in the earnings reports, but your typical regurgitated press release style "article" is about the least useful way of presenting the data I can think of.

I hope the financial reporters (or whatever they're called) would be free to do more digging into anomalies, and have more help spotting those anomalies, thanks to software.


It's definitely a great question. Even if the articles separately would be considered mere reporting, is the creation of the tool part of the journalistic process?

Another good one might be whether there's an inherent gain in machine-generated news to play by the same rules and constraints as regular human generated news? Is a traditional news article the correct format for a machine generated news, or if some of the constraints of traditional articles are due to human limitations. Basically, which of the rules of traditional articles are due to the creator and which are there to make it easier to consume the information for the reader.

Is there a gain in creating 1 000 articles of company earnings versus condensing or distributing the same (roughly) information in another manner. Mainly, would the reader be able to gain more from news created in another format or medium.

What happens here, is it similar to when newspapers moved from print to web? Initially staying in almost the same format as they had been for hundred years?

When we talk about tools and intermediate forms in relation to journalism and journalistic process, there's also the question of transparency. Is only the end product journalism, or is the end product only a medium for the message? For example, The Correspondent has been a vanguard in this front in traditional journalism, opening up the intermediate forms of the process to the public.

Should we consider for example the aforementioned "thousands of articles on company earnings reports from each quarter" as a intermediate form, rather than the journalistic end product - just a transparent intermediate form? If so, through that lens, is article as a format still the best means for sharing this information?


I'd call it reporting rather than journalism but that isn't based on anything other than how I think about the two things. One is a mechanical regurgitation and the other requires some thought to make connections or inferences.

Isn't the elephant in the room here that the stories that can be auto-generated don't have a creative component? They are just reporting some factual event or data. That is exactly analogous to robots that can assemble something on a factory line because all of the steps can be pre-programmed.

"Real" journalists are the ones that go through various facts and other tidbits and realize that there is a story that is deeper than just the facts. They research and tell that story so that the reader can understand it and appreciate why it is important or insightful. That is something that a 'robo-journalist' won't be able to do for a while.

For example, you can easily write a script to publish descriptions of the local high school football game every Friday night. But a robot doing that won't recognize when one of the team's members is exceptional, or how a team has changed its tactics to increase its ability to win games, or the impact of new facilities have had on the teams performance. Connecting those dots are still outside the realm of possible for these things.


I'm a fan of Quakebot by the LA Times. The bot even gets a byline: https://www.latimes.com/local/lanow/la-me-earthquakesa-earth...

So lots of effort is spent to collect data, make it machine readable data, distill into a selected set of information (still machine readable?) and then... it is obfuscated again by converting it back into natural language which must be parsed again by humans? What a waste! Couldn't that last step be skipped or put under the user's control? Say, my personal assistant/agent/filter/script could ingest the actual data and act on it on my behalf? Maybe tell me only what is actually relevant for me. Such that I don't have to wade through heaps of fluffy bs all the time.

> “I hope we’ll see A.I. tools become a productivity tool in the practice of reporting and finding clues,” said Hilary Mason, the general manager for machine learning at Cloudera, a data management software company. “When you do data analysis, you can see anomalies and patterns using A.I. And a human journalist is the right person to understand and figure out.”

While there's been a lot of great stuff happening on the front of machine-generated news for a while now, data analysis is definitely another great target, with perhaps some more immediate gains, when it comes to AI / general automation in journalism.


Interesting timing when you consider the layoffs and AI-assisted content. It's like the people that lost their jobs were replaced, so really the only changes ahead for many firms is content strategy, which is what will drive traffic.

Yet, the elephant in the room is the fact that more and more attention span is held by just a handful of platforms. If economic trends are any indication, the coming 'winter' will be rough for many firms that are competing for fewer dollars that aren't on Facebook or Google.

Interesting times ahead...


If you've ever actually read these financial articles you would know that they are basically hot garbage. Usually someone just wrote a template that scrapes a bunch of numbers out of a financial report and puts them in predetermined places. It has no utility beyond reading the financial report itself aside from being marginally more accessible.

What's an example of the most impressive article written by a bot, in your opinion (fellow HN reader)?

It would be really ironic if this article was machine generated.

Sounds like they finally learned to code



Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: