
The second largest version of Wikipedia is written mostly by one bot - jxub
https://www.vice.com/en_us/article/4agamm/the-worlds-second-largest-wikipedia-is-written-almost-entirely-by-one-bot
======
4cao
This endeavor looks largely orthogonal to what the objectives of an online
encyclopedia should be. Creating as many stub articles as possible and filling
them with "formulaic, generic, and reusable templated sentences with spots for
specific information" seems more like a recipe for an automated content farm
than for "disseminating the sum of _human_ knowledge."

It would be most interesting to know what the 148 active Cebuano Wikipedia
users think of the 5,331,028 articles the bot created, ostensibly for them.
Too bad nobody apparently cared to ask.

In particular, since Cebuano speakers are likely to be fluent in Tagalog
and/or English as well, they can easily use one of the other Wikipedia
editions too. Without the hyperactive bot, the much smaller Cebuano Wikipedia
would arguably be more relevant, reflecting topics truly of interest to the
community.

While the number of articles is a convenient way of comparing Wikipedia
language editions, it only works as such to the extent that the articles are
kept to a certain standard. It seems to me that what we are observing here is
yet another example of the situation that when a measure becomes a target it
ceases to be a good measure.

~~~
rcthompson
The counterpoint is that automatically-created stub articles serve to
encourage community editing. It's much easier to edit an existing article than
create a new one from scratch. This is one of the key principles behind the
Gene Wiki project[1], which creates stub articles for human genes for this
reason:

> Basic articles (called “stubs”) were systematically created based on content
> extracted from structured databases. These stubs are then edited by the
> broader Wikipedia community, while “bots” keep the structured content in
> sync with the source databases.

(The "structured content" mentioned is the info box on the right-hand side of
a gene article. Nowadays I believe this is populated directly from
Wikidata[2].)

Note: I am a member of the lab that runs Gene Wiki, but my work is unrelated.

\---

[1]:
[https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5944608/](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5944608/)

[2]:
[https://www.ncbi.nlm.nih.gov/pubmed/26989148](https://www.ncbi.nlm.nih.gov/pubmed/26989148)

~~~
_-___________-_
This seems fine as long as the articles are clearly marked as machine-
generated. Machine translation regularly garbles the meaning of text, while
producing readable text that has correct sentence structure etc. This is a
major problem in an encyclopedia.

~~~
yorwba
The subtitle of the article makes it sound like the text in question is
machine-translated, but it is created by filling a template with structured
data. So long as the template is correct and the data source is accurate, the
meaning won't be garbled.

~~~
_-___________-_
I can't really see how the number of translated articles could be as huge as
it is using only that approach.

~~~
yorwba
Why not? I tried a random article and got one about a park:
[https://ceb.wikipedia.org/wiki/Atokad_Park](https://ceb.wikipedia.org/wiki/Atokad_Park)
Note that there's no English article about that park that could have served as
the source for a machine translation. The article mostly lists a bunch of
facts about the location that are easily available in public databases. I
don't know how many entries GeoNames.org has, but this park has the number
5063315, so there should be material for quite a lot of articles.

------
sings
I always thought it was a bit bizarre that different language editions of
Wikipedia contain different information. It seems the focus should be more on
translation than content creation. Maybe that isn’t practical with the current
structure, but surely the aim should be a definitive knowledge graph rather
than a disparate and unevenly duplicated set of articles. Just my two cents –
I am sure many have put a lot of thought into how to best tackle this.

~~~
StavrosK
How do you mean? I'm fine with the fact that the Greek Wikipedia doesn't
contain an article about the Boston Tea Party, but I like that it contains an
article about the 1821 rebellion. Requiring the information to be the same
across languages would mean that either both should be translated, or, if no
translator can be found, one should be deleted.

EDIT: Or do you mean contain the same information between different languages
of a specific article?

~~~
charwalker
I think they mean 2 articles in 2 languages with the same content, or as close
as a translator can get. Very difficult to keep updated without automation but
that seems like something they want to steer away from until no longer reliant
on machine translations.

------
peterburkimsher
I discovered this in 2018, when comparing lists of languages supported by
different software and the number of speakers.

[https://peterburk.github.io/i2018n/#wikipedia](https://peterburk.github.io/i2018n/#wikipedia)

Having machine-translated content is powerful for SEO, but I don't know how
practical that is for Cebuano. It would be nice for English to no longer be
practically required for people to become computer literate.

~~~
BiteCode_dev
> It would be nice for English to no longer be practically required for people
> to become computer literate.

French here. We are terrible at english in my country.

Still, the fact most information in computing is shared in english is a god
send. Sure, you have to learn it, but then:

\- no need to search for it in so many languages

\- no need to produce translations of tutorials/docs/comments in so many
languages

\- the community to share and communicated with is huge and diverse

\- english is way more efficient than french, spanish, german or chinese to
talk about technical stuff

~~~
seventh-chord
Genuinely curious about your last point (don't know much about the topic). Is
english intrinsically better at this, or is it because of the presence of
jargon? Is it a studied phenomena, or is it something most people feel?

~~~
BiteCode_dev
English is usually shorter than other latin-based languages. It's longer than
ideogram based ones but you don't have to learn 100000 symbols to express
yourself in it.

It also has a very simple grammar compared to most languages. Take this
sentence:

"I would like not to go to school today"

The french equivalent would be:

"Je voudrais ne pas aller à l'école aujourd'hui."

"would like" is a simple combination of two words, but in french you need to
know the precise conjugation of it.

"not" is actually expressed as 2 words with "ne pas", which can be positioned
in several ways.

Infinitive, like with "to go", is simple in english: just add "to". In french,
each verb is different, like "aller".

Then you got "the" in any circumstances in english, but the "l'", could also
be "le, la, or les" depending of the word after it. A;so remember that each
word is either feminine or masculine in french, even a stone or the sun.

Then "à" and "école" got an accent. French has many of them, you need to know
the right one, where to place it, how to pronunciation it and type it on the
keyboard.

Finally, "today" vs "aujourd'hui". I know which one is easier to type in a bug
report.

Not to say English doesn't have weird traps, but it's very, very relaxing
compared to the rest. And much more efficient.

Also describing a view of the country side with it feels a bit limiting. But
I'm not Shakespear :)

~~~
andrewzah
And then you have Korean.

오늘 학교에 가기 싫다 (I don’t want to go to school today)

Hangeul is a syllabary so anyone can read it, just like Latin. It was
specifically designed in response to peasants not being able to read Chinese
hanzi.

Things like the subject and topic can be and is omitted when it’s superfluous.
(The above sentence omits the subject and subject particle. There is no topic
here). English almost always needs to specify the subject, asides from very
casual speech/slang. Korean, like english, doesn’t need to specify gender of
nouns, and it also doesn’t need “a” or “the” markers. The location particle
above could be dropped too.

However Korean does have complex honorifics and formality conjugations, which
typically get longer the more formal/polite it is. Above we have the plain or
dictionary form, which is usually the shortest form as well.

~~~
BiteCode_dev
How would you express "would like"?

"I don’t want to go" and "I would like not to go" express different things.

One is expressing opposition right now, the other one is expressing desire or
even a request, potentially while the action is already engaged. The first one
is definitive, the second one is wishful thinking or negotiation.

~~~
andrewzah
I don’t see any practical difference other than using more polite, indirect
language. The effect of “would like not to” is the same in the end with
opposition.

I don’t think there’s a 1:1 translation of “would like not to”. I’d probably
say something like “it’d be good if I did / didn’t do X”. Which is less direct
than the equivalent “I do / don’t want to do X”.

------
tomrod
I like this because growth and progress of knowledge base, regardless of
language or hosting platform, is incremental and cumulative. Wikipedia shows
this effectively in the English channel because it happened so quickly. But
even the legacy encyclopedias did this through centuries. Whether a bot lays
the groundwork from other reference points or dedicated humans do it is sort
of immaterial, I think, because the very long run this benefits the people who
speak this language.

In an age where languages are dying with their last speakers, Visayan has done
much to preserve their diversity -- although not a written/codified language,
volunteers give radio broadcasts in the language, books are published in it
(here the lack of codification shows by variance in spelling, verb
conjugation, and sentence structure), and similar. Thank you to this
wikipedian for doing something to preserve a wonderful language (I mention in
another comment I am fluent and miss the regular speaking of it).

------
tangoalpha
Clicking on random article on
[https://ceb.m.wikipedia.org/wiki/Espesyal:Random#/random](https://ceb.m.wikipedia.org/wiki/Espesyal:Random#/random)
, looks like every article is that of either a tree, or an animal, or an
insect, or a place...

~~~
acqq
And looking at one whole initial article it generated:

[https://ceb.wikipedia.org/w/index.php?title=Klakkabekken_(su...](https://ceb.wikipedia.org/w/index.php?title=Klakkabekken_\(suba_sa_Noruwega,_Odda\)&oldid=14081773)

it describes where the place is and gives the citation as "found in the
Geonames.org database".

------
qwerty456127
So they mean to tell us "insignificant" facts and articles must be deleted?

~~~
brodo
The German Wikipedia would be twice as big if mods weren’t obsessed with some
made up criteria of relevance.

~~~
FalconSensei
That sad. In the end, all this would (if not already) make then just go for
the English version. I already do this (I'm Brazilian) as the Portuguese
version is nowhere near the international (English) version in terms of
completeness and being up-to-date.

BTW, do you have a link for their terms on "relevance"?

~~~
Polylactic_acid
The English version isn't particularly free. I attempted to add a page about a
file format that is fairly well used but doesn't have a huge amount of
information online about it. The only real source is a zip file from a
companies website which contains a pdf with the file spec and some example
programs. Unfortunately the editors decided that due to the lack of
referencable sources, they would rather no article exist at all.

~~~
qwerty456127
This bullshit policy drives me mad. I will start donating regularly once it's
cancelled. Not sooner, nor later.

~~~
Polylactic_acid
I understand it for some cases where the mods just need to stop people making
up random crap on topics that don't exist or can't be verified. But in this
case a single reference is more than enough to write the whole page because
the spec is literally the only source of truth on the topic.

Unfortunately I think the mods may be too passionate about "protecting the
integrity of wikipedia" that they let legitimate content be deleted. It also
doesn't help that the wikipedia UI for disputes and edits is really confusing
and I had a hard time trying to work out what was going on or how I
communicate to this moderator. The whole system is designed for power users
only.

~~~
scandinavegan
It's important to note that there are not only "the mods", but two opposing
factions in Wikipedia: the Deletionists and the Inclusionists. [1]

I too agree that we don't need articles on someone's cat, but I've had
articles deleted as not notable on indie web comics and indie role-playing
games with hundreds or thousands of readers or copies sold.

I thought that the fact that the RPG was published and publicly available, and
was being discussed in RPG forums would make it notable enough, especially
when it was mentioned as an inspiration for rules in more traditional RPGs.
But since they hadn't been mentioned in any published articles they were
deleted and there was no real way for me to fight it. I had added stuff like
the list of contributors, publishing year, and overview of the rules and
setting, with no personal discussion of the game.

The result was that I stopped trying to improve Wikipedia, because I don't
have the time or interest to fight people with an infinite amount of time that
deletes my additions. My main contribution to Wikipedia wouldn't have been on
the articles on Barack Obama or World War II anyway, as they already have
people who are experts that add information. I could have brought information
on my specialized topics of interest, but realized that they would all seem
non-notable to someone who's not interested in the same thing and would be
deleted.

[1]
[https://en.wikipedia.org/wiki/Deletionism_and_inclusionism_i...](https://en.wikipedia.org/wiki/Deletionism_and_inclusionism_in_Wikipedia)

------
brokensegue
Slightly pedantic but the largest "Wikipedia" (depending on how you define it)
is [http://wikidata.org/](http://wikidata.org/) and it's also primarily
written by bots.

~~~
playpause
That’s a wiki, not a Wikipedia.

~~~
brokensegue
but both are wikimedia projects

