
In Sweden, Sverker Johansson and His 'Bot' Have Created 2.7M Wikipedia Articles - ilamont
http://online.wsj.com/news/article_email/for-this-author-10-000-wikipedia-articles-is-a-good-days-work-1405305001-lMyQjAxMTA0MDEwMzExNDMyWj
======
dkhar
I'm not very in-tune with Wikipedia's culture (which, I've read, is very
nuanced and rigid[1]), but I really don't see why this is a bad thing, given
the information in the articles is accurate (and the article gives the
impression that glitches are rare).

If nobody else was going to create an article about some species of butterfly,
I don't see why adding that information would be harmful to Wikipedia. Does it
make Wikipedia harder to read? Harder to search?

I don't think "it's not written by a human" is a valid argument for factual
information, and I've never seen any evidence to suggest that it should be
one.

EDIT: I found this bot's edit log!
[https://sv.wikipedia.org/w/index.php?title=Special:Logg/Lsjb...](https://sv.wikipedia.org/w/index.php?title=Special:Logg/Lsjbot)

Here are a few articles randomly picked out of the latest 1000:

[https://sv.wikipedia.org/wiki/Urochloa_plantaginea](https://sv.wikipedia.org/wiki/Urochloa_plantaginea)

[https://sv.wikipedia.org/wiki/Brachiaria_vittata](https://sv.wikipedia.org/wiki/Brachiaria_vittata)

[https://sv.wikipedia.org/wiki/Eutriana_repens](https://sv.wikipedia.org/wiki/Eutriana_repens)

[https://sv.wikipedia.org/wiki/Andropogon_decipiens](https://sv.wikipedia.org/wiki/Andropogon_decipiens)

After looking at these, I'm beginning to see why there is some backlash. There
are literally thousands of articles here that read "X is a species of grass.
It got its name from Y and is described in Z catalog." The only people who
would need this information are botanists, and they already have their own
specialized sources. I'm still not against bot-produced content, but I
understand why some people oppose initiatives like this.

[1]
[http://www.gwern.net/In%20Defense%20Of%20Inclusionism](http://www.gwern.net/In%20Defense%20Of%20Inclusionism)

~~~
gwern
At least part of the problem is that he's generating what one might call 'info
trash': he's taking highly structured information from databases, and turning
it into natural-language prose, a data source of less value since it's less
structured.

These prose versions are now going to steadily fall out of sync with the
original databases, be much more prominent in Wikipedia and Google, diverge
from each other, be harder to parse and perform any complex analysis on (a
database is at least relatively comprehensible, but to parse his dumps you
have to hope you can reverse-engineer it, no other bots or editors have
modified it much, and that he didn't get clever with his format strings), etc.
If at some point one wanted to change something about the presentation, it's
no longer a matter of editing one template and now the user-friendly HTML view
onto the database is automatically updated for all viewers, now one has to run
a carefully-written bot on millions of articles (and since that is beyond
semi-automated bots, you have to have special permission to run it).

It would have been better to work on merging databases or exporting them into
a structured site, something like Freebase.

~~~
baddox
Sometimes I appreciate what you call "info trash." For example, I assume there
is a bot that turns census data into articles for every incorporated community
in the US, like this:
[http://en.wikipedia.org/wiki/Agency,_Missouri](http://en.wikipedia.org/wiki/Agency,_Missouri).

I still think the article is useful as is, with just the map, data sheet, and
demographics, and of course many incorporations have additional human-composed
information added.

I could imagine some more structured data source, where the main article
redirects to a table and scrolls to the correct spot. I would be fine with
that, but as far as I know that concept doesn't exist on Wikipedia.

~~~
koralatov
Looking at the history of that page,[0] it appears a couple of different bots
have worked on it, with human intervention. (I suspect ``Ram-Man'' is an
earlier version of ``Rambot'', but I could be wrong.)

I've read pages like that before, and it never once occurred to me that they
were anything other than the result of sheer human bloody-mindedness. They're
not `exciting', but they're very clearly written in an easily parseable way
that doesn't scream ``machine-generated'' to me. If this is indicative, the
quality of output of these bots is excellent, and a good use of automation ---
let the bots fill out the dry factual stuff, and the humans write the less
tangible, non-statistical stuff.

[0]:
[http://en.wikipedia.org/w/index.php?title=Agency,_Missouri&a...](http://en.wikipedia.org/w/index.php?title=Agency,_Missouri&action=history)

~~~
MatmaRex
Ram-Man is a human account. The same person operates the Rambot bot account.
You can click on their usernames on the history page to see their user pages,
which usually describe these things.

------
dangayle
I would rather have a bot written stub than nothing at all when I search in
Wikipedia. Like it is mentioned in the story, some subjects are under
represented while others are super saturated.

~~~
_delirium
If Wikipedia were the only website I'd agree, but I don't find it useful for
Wikipedia to have stubs that are simple copies of other freely available
sources (especially more authoritative ones), without _some_ kind of synthesis
or value-add. In the case of species, for example, I think there is little
value in a Wikipedia stub that is just a reformatted copy of the ITIS entry
([http://www.itis.gov/](http://www.itis.gov/)). If that's what I wanted, I'd
just go to ITIS. When I see a Wikipedia result in a Google search result I
typically expect it to be the basic taxonomic information one would find in
ITIS _plus_ something more. Otherwise it feels like some of that autogenerated
SEO-style spam, which Google should penalize.

It also doesn't always really help in jumpstarting future improvements, if the
structure doesn't align with the granularity that makes sense for an
encyclopedia article. For example if there is a genus with three species, each
of which is very similar and has very little distinctive written about it, the
normal organization would be to write one article on the genus, with a short
discussion of each subspecies in the main article, not broken out into three
separate duplicative and near-empty articles. You'd only break out into
separate articles on each species if there's enough to write about them that
covering them as a group becomes unwieldy (this varies widely by species). If
I were to hazard a prediction, it's that the English Wikipedia will as a
result tend towards better organized species coverage than the Swedish
Wikipedia, which will never get around to reorganizing these articles.

~~~
baddox
> If Wikipedia were the only website I'd agree, but I don't find it useful for
> Wikipedia to have stubs that are simple copies of other freely available
> sources (especially more authoritative ones), without some kind of synthesis
> or value-add.

The value-add is that I _know_ about Wikipedia, but not about whatever more
authoritative botanical site you're mentioning.

~~~
tokai
That is why we search. With the nearly empty wikipedia page in place, you
might never be driven to find the better source.

~~~
aninhumer
If I'm not driven to find a better source, then presumably I was satisfied
with the information on Wikipedia. Quick reference for the most relevant
information is what Wikipedia is good at.

------
worldsayshi
The harm in using bots in this way, I suppose, would be that correcting
potential misinformation would be much more work intensive than adding the
information in the first place.

It seems to me that the wikipedia model works because/when corrections are at
most as costly as introducing the error.

------
mlinksva
I think this is a good thing, but ideally no bot would be required: if no
article in your language, show all the facts from wikidata, eg
[http://tools.wmflabs.org/reasonator/?q=Q1339](http://tools.wmflabs.org/reasonator/?q=Q1339)

~~~
koshatnik
There's a proposal to create 'virtual' articles from Wikidata:
[http://meta.wikimedia.org/wiki/Wikidata/Notes/Article_genera...](http://meta.wikimedia.org/wiki/Wikidata/Notes/Article_generation)

The benefit of this is that articles would be dynamically synthesised from the
latest data when the user requests them, and not actually created and added to
the wiki. This would prevent the creation of a potentially infinite number of
articles on subjects not significant enough to merit a write-up by a human
author, so could be a way to combat the 'bot inflation' of article numbers.

I don't think anyone's started work on it yet, but if someone fancies it...

------
Vik1ng
The only thing that pisses me off is that when I contribute I'm expected to
write half a site in addition to such an infobox, otherwise a mod comes along
and movees it into my sandbox.

That's the main reason I always stopped editing Wikipedia right away after I
tryed.

~~~
stanzheng
Exactly, agreed this is why fringe projects are created for more opinionated
subjective matter or cultural matters.

If you love contributing to wikipedia style knowledge, @Localwiki is often the
anti-wikipedia, about local relevant knowledge but colloquial prose is usually
OK as long as its factual and not malicious.

[http://localwiki.org/](http://localwiki.org/)

Increasing the barrier to entry, generally never helps inclusion or is newbie
friendly.

------
wingi
Yes - he should integrate all external data as WikiData and add articles
referencing the wikidata entries.

And if the external botanic data is updated he should update the wiki data ...
and not only the less-structured articles.

------
TomGullen
So what are the arguments against what he is doing?

Seems fine to me, and sounds like it's likely to be adding value.

~~~
cjensen
The argument against would be that 2.7M articles containing no useful content
is not helpful. (Assuming, which I do, that the articles do not contain useful
content).

~~~
callesgg
They contain loots of useful information

Here is an article from an earlier comment,
[https://sv.wikipedia.org/wiki/Eutriana_repens](https://sv.wikipedia.org/wiki/Eutriana_repens)

After reading that I know that Bouteloua repens is a form of grass. Who first
identified it. Who gave it is current name and some other common names for the
plant.

I think it is amazing. If someone has already filed in a the information it
just seams like a waste of man hours to retype it to Wikipedia.

------
drivingmenuts
Quote: His ability to document relatively obscure facts helps him combat one
of the biggest problems he sees in the Wikipedia community. Many entries, he
argues, are made by white male "nerds."

Quote: "It saddens me that some don't think of Lsjbot as a worthy author," he
said. "I am a person; I am the one who created the bot. Without my work, all
these articles would never have existed."

And just how is he, a white male nerd, combatting the problem?

------
alkonaut
Some of these are borderline useless, but I can definitely see the point in
bots pulling structured data into templates IF the data is rich enough to gain
from being in human-readable format OR if the topic is noteworthy enough to
warrant a stub, that is, it is likely to become a hand written article in the
future.

------
hugh4life
I used to like watching the article counts on this page:
[http://meta.wikimedia.org/wiki/List_of_Wikipedias](http://meta.wikimedia.org/wiki/List_of_Wikipedias)

But then the Dutch started using bots to inflate their article count... which
was ok for a year or two but then other wikipedias started doing the same
thing. Now I watch edit number of edits.

I don't care that bots are used, but article counts are completely useless in
comparing wikipedias and it didn't used to be that way. Sure there were better
measurements, but article counts were still pretty good.

[http://meta.wikimedia.org/w/index.php?title=List_of_Wikipedi...](http://meta.wikimedia.org/w/index.php?title=List_of_Wikipedias/Table&oldid=3055022)

~~~
Crito

        Wiki "size" = [Number of Articles] - [Number of Stub Articles]
    

Is a human-written stub article about minor characters from some fantasy book
more valuable than a robot-written stub article about a type of grass? Is
either article worth considering when trying to create some sort of metric for
"wiki health"?

Is this metric actually useful for anything, other than as a curiosity? If I
were multi-lingual and was trying to decide which wiki to use, I'd go with
whichever wiki was in my primary language. If I did not find the information I
was after, I would check the other wikis for the same article (nicely listed
on the left portion of the screen).

------
staticelf
I think this is going to be more common in the future. Not only for Wikipedia,
but for every page that serves some kind of content.

Perhaps even sites like 9gag or similar could start out with some computer
generated memes ;)

~~~
stanzheng
Ha! yeah I thought about a hack that just gamified random scraped words with
memes and see what rises to the top and receptive to humans.

But still its two different horses, Wikipedia is suppose to be a central
repository of community created knowledge. The veracity of the information and
expertise of authors is what secured it as a credible popular source.

Would we say the same thing if this was a central repository of just Stubs
that were computer generated from the get-go?

------
zwieback
I remember on the original mother of all Wiki's there was a vigorous
discussion about walled gardens and a link on the front page warning about
such:
[http://c2.com/cgi/wiki?WalledGardens](http://c2.com/cgi/wiki?WalledGardens)

If all those botanical pages aren't linked by any other Wikipedia pages it
seems like a single link to a "Botanipedia" would be just as well.

------
grecy
Is the source of his bot available?

I know there are tons of wiki-bots, I'm very interested in the 1st half of his
code, the scraping piece.

------
kumarski
This is awesome.

I wonder if he would considering adding some element of readability algorithms
to it.

------
grizzles
Why doesn't this guy just package up the database, and autogenerate the page
if the page is blank. That would have more utility than mixing human and bot
generated articles.

~~~
shutupalready
Unless there's some nuance that I'm missing, it sounds like that that is
exactly what he's doing.

~~~
xorcist
I don't think that's what grizzles is saying, I think the idea was to patch
the wiki software to generate these article stubs on the fly from the actual
source instead of batch importing them once.

Sounds like a good idea to me, but I don't know the Wikipedia culture so they
might have reasons against that.

------
shutupalready
Why is he doing this in Swedish and two versions of Filipino? He apparently
speaks English, so I assume he could adapt his code to do English entries as
well.

~~~
eru
I guess he only has a finite amount of time and energy. Converting the bot
might be the easiest part, but then, you also need to convince the English
'pedians.

~~~
stanzheng
Also I would imagine many of the articles in English may have been already
seeded as in general Swedish and Tagalog are more esoteric.

------
reality_czech
This Swedish guy needs to cut it out. Wikipedia doesn't need more articles
that are just "BORK BORK BORK"!

