Hacker News new | comments | ask | show | jobs | submit login
In Sweden, Sverker Johansson and His 'Bot' Have Created 2.7M Wikipedia Articles (wsj.com)
134 points by ilamont on July 14, 2014 | hide | past | web | favorite | 76 comments

I'm not very in-tune with Wikipedia's culture (which, I've read, is very nuanced and rigid[1]), but I really don't see why this is a bad thing, given the information in the articles is accurate (and the article gives the impression that glitches are rare).

If nobody else was going to create an article about some species of butterfly, I don't see why adding that information would be harmful to Wikipedia. Does it make Wikipedia harder to read? Harder to search?

I don't think "it's not written by a human" is a valid argument for factual information, and I've never seen any evidence to suggest that it should be one.

EDIT: I found this bot's edit log! https://sv.wikipedia.org/w/index.php?title=Special:Logg/Lsjb...

Here are a few articles randomly picked out of the latest 1000:





After looking at these, I'm beginning to see why there is some backlash. There are literally thousands of articles here that read "X is a species of grass. It got its name from Y and is described in Z catalog." The only people who would need this information are botanists, and they already have their own specialized sources. I'm still not against bot-produced content, but I understand why some people oppose initiatives like this.

[1] http://www.gwern.net/In%20Defense%20Of%20Inclusionism

At least part of the problem is that he's generating what one might call 'info trash': he's taking highly structured information from databases, and turning it into natural-language prose, a data source of less value since it's less structured.

These prose versions are now going to steadily fall out of sync with the original databases, be much more prominent in Wikipedia and Google, diverge from each other, be harder to parse and perform any complex analysis on (a database is at least relatively comprehensible, but to parse his dumps you have to hope you can reverse-engineer it, no other bots or editors have modified it much, and that he didn't get clever with his format strings), etc. If at some point one wanted to change something about the presentation, it's no longer a matter of editing one template and now the user-friendly HTML view onto the database is automatically updated for all viewers, now one has to run a carefully-written bot on millions of articles (and since that is beyond semi-automated bots, you have to have special permission to run it).

It would have been better to work on merging databases or exporting them into a structured site, something like Freebase.

Sometimes I appreciate what you call "info trash." For example, I assume there is a bot that turns census data into articles for every incorporated community in the US, like this: http://en.wikipedia.org/wiki/Agency,_Missouri.

I still think the article is useful as is, with just the map, data sheet, and demographics, and of course many incorporations have additional human-composed information added.

I could imagine some more structured data source, where the main article redirects to a table and scrolls to the correct spot. I would be fine with that, but as far as I know that concept doesn't exist on Wikipedia.

> I could imagine some more structured data source, where the main article redirects to a table and scrolls to the correct spot. I would be fine with that, but as far as I know that concept doesn't exist on Wikipedia.

The structured data source now exists, but how to present its data is still being worked out. You can add information to it now, since the goal is to collect a bunch of structured information and then incrementally figure out how to display it, either on its own or integrated into Wikipedia articles (bot additions here are also very welcome): https://www.wikidata.org

I believe there's going to be some offloading of some structured Wikipedia information so it pulls from Wikidata in the future, instead of being maintained "manually" in articles. For example the geotags that are currently buried in Wikipedia articles' markup will probably be centralized to Wikidata soonish and just pulled from there to display. And infoboxes may be auto-populated from the Wikidata information as well. Sending people to auto-generated stub articles when a "real" article is missing is an interesting idea that might happen longer-term.

Looking at the history of that page,[0] it appears a couple of different bots have worked on it, with human intervention. (I suspect ``Ram-Man'' is an earlier version of ``Rambot'', but I could be wrong.)

I've read pages like that before, and it never once occurred to me that they were anything other than the result of sheer human bloody-mindedness. They're not `exciting', but they're very clearly written in an easily parseable way that doesn't scream ``machine-generated'' to me. If this is indicative, the quality of output of these bots is excellent, and a good use of automation --- let the bots fill out the dry factual stuff, and the humans write the less tangible, non-statistical stuff.

[0]: http://en.wikipedia.org/w/index.php?title=Agency,_Missouri&a...

Ram-Man is a human account. The same person operates the Rambot bot account. You can click on their usernames on the history page to see their user pages, which usually describe these things.

But now you are arguing the "it would be better if you did Y instead of X, so therefore stop doing X" fallacy. It's not like he is pillaging the original databases and leaves them burning. They are still there and his copying of data from them doesn't make them worse. You or anyone else are welcome to spend your free time exporting and merging the databases into Freebase.

> But now you are arguing the "it would be better if you did Y instead of X, so therefore stop doing X" fallacy.

That's not a fallacy, that's just good advice. If this Swedish dude wants praise, he should spend his time doing things which are genuinely good, not dubious and possibly net-negative in the long run.

Some people (a fairly significant number) find it much easier to parse information presented in English sentences as opposed to the presentation forms typical in structured data (often table form, with some kind of filtering).

In many cases, it is very rare for facts like those presented in botanical databases to change: it often means a plant has been recategorized, which is a non-trivial thing to do. It is entirely appropriate for this to be handled manually, given how rare it is.

Your arguments about it being better to work on a structured version present a false dichotomy: it isn't Wikipedia OR a structured version, it is Wikipedia AND structured data that need to be improved.

>It would have been better to work on merging databases or exporting them into a structured site, something like Freebase.

Why is this not appropriate for Wikidata?

I was trying to think of Wikidata's name but a search didn't bring up what I was looking for, so I simply used the database site whose name I could recall (because it's always amused me).

Except that no one uses Freebase.

I think the problem is that the entries are very low-quality and are being produced in such great quantities that it will be hard for anyone to turn them into passable articles.

The purpose of Wikipedia is not to be a collection of all the factual information it gather. You or I might wish for it to be one, but that isn't what its creators mean for it to be. Each individual article is expected to meet Wikipedia's guidelines for relevance and to have some minimal level of quality. If it isn't possible to write a good, encyclopedic article on a topic, Wikipedia's stance is generally that it should be deleted (or, if the article is just overly specific and the information is relevant to a broader topic, the article might be incorporated into a section in a more general article).

I think these stubs have the benefit of reducing the barrier to entry for future contributors. If I search for something and find a stub, I can easily throw in even one sentence, and the article is incrementally improved. Whereas if there's no article, I am much less inclined to make a new one.

Exactly: creating a new Wikipedia page is pretty intimidating because most people do not understand WP's template system or how to make an infobox.

> The only people who would need this information are botanists

I think that looking up plant species are exactly the kind of thing people would want to use an encyclopedia for.

These are (almost) one-sentence articles. They're not appropriate for an encyclopaedia. If anything, they should be in a 'list' article format.

If you want these kind of one-sentence descriptors, they would be better served in a specialist publication.

Why isn't it appropriate for an encyclopaedia?

For example, the 1910 Encyclopaedia Britannica's complete entry for "denim" is "(an abbreviation of the serge de Nimes), the name originally given to a kind of serge. It is now applies to a stout twilled cloth made in various colours, usually of cotton, and used for overalls, &c."

The entry for "Gimli" is "In Scandinavian mythology, the great hall of heaven whither the righteous will go to spend eternity."

It's not hard to find more examples. But I don't think people considered the EB less of an encyclopaedia for its use of one-sentence descriptors.

The heart of the matter is that there's precious little difference between a dictionary and an encyclopedia. Indeed, the EB's full name is "The Encyclopædia Britannica: a Dictionary of Arts, Sciences, Literature and General Information"

To double check that it's not limited to the EB, I looked in Harmsworth's Universal encyclopedia. The entry for "fulcrum" is "(Lat. fulcrum, a prop) Fixed point in the mechanical system of a lever about which the lever can rotate. See Lever." The entry for "gumboil" is "Small abscess on the gum arising in most cases from decay at the root of a tooth."

See http://menvall.wordpress.com/2010/09/14/on-wikipedias-attemp... for an analysis of the distinction between the two, and the conclusion that "everything that is included in a dictionary also can be included in an encyclopedia, whereas all that is included in an encyclopedia either can or can’t be included in a dictionary. This relation is, however, completely misunderstood by some editors of Wikipedia."

(Had you written that it wasn't appropriate for Wikipedia, than that's a different issue. I speak now only of the broad category of "encyclopedia".)

Triple-checking, the entry for "gumboil" in Wikipedia (at http://en.wikipedia.org/wiki/Gumboil) redirects to "Intraoral dental sinus". The complete entry is two sentences long:

> Intraoral dental sinus (also termed a parulis and commonly, a gumboil) is an oral lesion characterized by a soft erythematous papule (red spot) that develops on the alveolar process in association with a non-vital tooth and accompanying dental abscess.[1] A parulis is made up of inflamed granulation tissue.

By your definition, this "(almost) one-sentence" article should be removed from WP, no?

Firstly, just because you can find brief entries in other encyclopaedias doesn't mean that it's good form.

Secondly, I didn't demand complete excision with the kind of frothy fervor you're implying. I said a better format for these "X is type Y, discovered by Z, listed in Q" is collating them all in a list format. WP has list article like this aplenty - dense, easily digestible information on similar topics, allowing quick and easy comparison and scanning.

As for your link, one of the bold highlights is "explains subjects in greater detail than a dictionary". Another of the three definitions of 'encyclopaedia' your link provides says "with data on and discussion of each subject identified" (my emphasis). So that's two out of three definitions that quite strongly indicate non-brief articles - your linked article is wrong from it's own source material, and hasn't made the case that dictionary-like brevity is suitable for an encyclopaedia.

By your definition, this "(almost) one-sentence" article should be removed from WP, no?

What, are you trying to 'catch me out' here? Do you think that's a good quality article? It's a stub, it's not what WP wants to encourage, and it's more like a dictionary definition than either "explaining a subject in greater detail" or "discussion of the subject". Yes, I think it's a bad article for any encyclopaedia - it's quite brief, and full of technical jargon. If you didn't already know the specific jargon, it's completely useless as a "general course of instruction" (the etymology argument from your link). And if you do know the jargon, you have a pretty good chance of working it out from the name alone; the article merely confirms the topic if you're unsure, but you don't get any more insight into it.

As 'trick questions' go, this one sucked.

Trick question? I'm showing that my question - "Why isn't it appropriate for an encyclopaedia?" - is meaningful, by giving counter-examples from three encyclopedias. This suggests that your definition is not aligned with how the term is used in practice.

I ask that you clarify your reasoning.

You say my linked-to reference "hasn't made the case that dictionary-like brevity is suitable for an encyclopaedia". The link isn't trying to make that delineation between the two. It's arguing (and I agree) that a dictionary is a type of encyclopedia, not that they are two different things. You mentioned some quotes, in bold. The author later comments on those exact same quotes (with bold translated to italics):

> These definitions show that whereas dictionary is defined by words alone: “reference work that lists words, usually in alphabetical order, and gives their meanings and often other information such as pronunciations, etymologies, and variant spellings“, encyclopedia is defined either as synonymous to dictionary: “the term is often interchanged with the word “dictionary,” as in the present work” or by a larger extension than dictionary: “explains subjects in greater detail than a dictionary”. There is thus no conflict between dictionary and encyclopedia. They are either synonymous or only have different extensions (i.e., encyclopedia including dictionary, but covering a larger set of phenomena).

I checked with the OED, at http://www.oed.com/viewdictionaryentry/Entry/52325 . It concurs, since its definition 1b. for dictionary is (italics mine):

> In extended use: a book of information or reference on any subject in which the entries are arranged alphabetically; an alphabetical encyclopedia

Yes, I'm saying that the article for "gumboil" in WP is not a stub, does not need to be longer than it is, and very much like what WP should support. While I agree with you in that the older print definition of the term is easier to understand than what WP has, that's at most one more line, and more likely solved by rewriting.

BTW, I also looked up Gimlé in WP. That's three sentences long, so a full two sentences longer than the 'Gimli' entry in Encyclopaedia Britannica.

Why must everything require more than a few lines to fit into your concept of an encyclopedia? Certainly Gimlé doesn't fit in a dictionary, so where else would it go?

This idea that dictionaries and encyclopaedias are separable things is entirely within your head - it's an argument you're projecting onto me. I haven't mentioned the word 'dictionary' at all, with the exception of one quote from your own source. You're attributing to me an argument that I'm not making - I couldn't care less whether you call an encyclopaedia a 'dictionary', an 'encyclopaedia', or a 'sauerkraut sandwich' here. I haven't said anywhere "That should be in a dictionary"

I'm talking about the function of an encyclopaedia - which your own link has sources generally requiring non-brief articles. Articles which discuss and expand on a subject. Even the etymology provided is 'general education', which implies more than mere definition of a word.

Yes, super-short articles like 'gumboil' or 'Gimle' should be rolled into larger, more comprehensive articles. There is plenty you could add to gumboil - an image to show one, demographic preponderances, common treatments, common complicating factors, all of which enhance the user's knowledge of the topic. It certainly should be reduced or modified in terms of jargon. As for Gimlé, there's no reason why it can't be rolled into a more comprehensive article on Asgard, Norse Mythology, or whatever. Check out the 'Elysium' article for ways you can expand it to make it a more useful article in its own right.

Another thing that you're missing is that WP (and myself) both view these things as undesirable, but not so undesirable that they should be destroyed as a matter of course. They're just bad articles - and contrary to what you're saying, they're far from complete.

In the case of the 'grasses' links of the OP, these are absolutely terrible articles (the irony being that they're chaff - an appropriately grassy reference). Yes, it's information, but it's very poorly laid out and hard to access or compare. It's the absolute barest information - and far, far from "general education" substansiveness. Cool, Brachiaria plantaginea is a grass, but let's have a look at the entry:

Brachiaria plantaginea [1] is a species of grass which was first described by Heinrich Friedrich Link, and got its current name of Albert Spear Hitchcock. Brachiaria plantaginea included in the genus Brachiaria, and the grass family. [2] [3] No subspecies are listed in the Catalogue of Life. (ta, google translate)

There is barely any information here beyond "It's a grass". What kind of grass? Is it grass like crabgrass? Like asparagus? Like bamboo? What are it's characteristics? Where do you find it? Is it peculiar to any animal's diets? How does it propagate? What does it even look like? Does it have defense mechanisms? Does it survive arid climates well? Are humans allergic to it at all? Not to mention that it's self-evident in the name Brachiaria plantaginea that it's in the genus Brachiaria.

It's an awful, very low quality article - regardless of whether or not you think such information belongs in an encyclopaedia, the article quality does not. Do you feel generally educated by that article? Do you feel like the thing that is Brachiaria plantaginea has been sufficiently discussed? Is the article self-contained (ha!) and explained in detail? These three questions are fundamental parts of the definitions of 'encyclopaedia' given by your original link (and which I don't particularly contest - I rather agree with them).

I'll repeat my earlier parenthetical comment: "Had you written that it wasn't appropriate for Wikipedia, than that's a different issue. I speak now only of the broad category of "encyclopedia"."

I mentioned dictionaries because I misunderstood you. Thank you for the correction.

"There is plenty you could add to gumboil" is of course true. It's also true for nearly every single deleted item in WP, including those which aren't sufficiently notable. It's also true of nearly every item which is currently rolled into a larger article. (Hence http://en.wikipedia.org/wiki/List_of_recurring_The_Simpsons_... vs. http://simpsons.wikia.com/wiki/Bernice_Hibbert )

This is an eternal debate by WP editors. Well-defined requirements and boundaries are not possible, only rough guidelines for most areas. This is one such area.

I agree that the information about B. plantaginea is weak. This is not atypical of biological entries in WP. Consider http://en.wikipedia.org/wiki/Calabash_tree , which I chose because it's only a few lines longer than the short entry in the 1910 Encyclopedia Britannica. (BTW, the Swedish entry in WP has a picture of the gourd, while the English one does not.)

After 9 years this stub entry still doesn't answer some of your questions, like "Is it peculiar to any animal's diets? How does it propagate? What does it even look like? Does it have defense mechanisms? Does it survive arid climates well? Are humans allergic to it at all?"

Worse is the line "The fruit pulp is used traditionally for respiratory problems." It doesn't say if it's actually effective, and if so, what is the method of treatment. Is it eaten? Smeared on the chest like a mentholated topical cream? Used as a suppository?

Thus your criticisms, while quite valid, should be tempered by context.

As another example, http://en.wikipedia.org/wiki/Hairy_long-nosed_armadillo is also a 4-line stub. If you look in the history you'll see it was once much more informative. This is because it copied text verbatim from http://armadillo-online.org/dasypus.html#pilosus . That source is under the CC by-nc-sa license, while WP does not accept non-commercial only license, so I believe it was rolled back for that reason.

It's relevant that the armadillo page was created by a bot in 2007, in almost exactly the current form. The main issue is likely that WP is a lousy place for species information. Perhaps it's because the primary literature doesn't meet the copyright requirements, and specialists who can create appropriate text are more interested in contributing to specialist compendiums?

For what it's worth, http://www.conabio.gob.mx/malezasdemexico/poaceae/brachiaria... says it's from Florida and Mexico to South America, with secondary distributions in the Old World. Kew gives details in (technical) English at http://www.kew.org/data/grasses-db/www/imp01488.htm .

Neither list anything about allergies, its defensive mechanisms, etc.

My own belief is that this bot information for B. plantaginea, etc. should be in an infobox of some sort, rather than free text. I feel that if I edit the text to include the information I identified, then a future bot sweep may be unable to handle the changes automatically.

If one-sentence articles are considered a problem, that seems like a reasonable choice for Wikipedia to make, but it would apply to human-written and bot-written articles.

It does apply, regardless of the source of the article. Wikipedia doesn't like stub articles.

Most articles started as stubs. At the moment many language versions of Wikipedia have different opinions about stubs.

e.g. german wikipedia don't allow stubs anymore and the admins delete & reverts more pages every day than new ones are created. It's maybe a cultural problem as such admins identify themself with 'their articles' and don't allow any changes.

The question of course is in how to determine when a short article is a stub. Some short entries are sufficiently complete for the purposes of WP, whose English guidelines say "there are some subjects about which very little can be written."

Consider http://en.wikipedia.org/wiki/Muati

> Muati is an obscure local god in the Sumerian pantheon. He is associated in some texts with the mythical island paradise of Dilmun, and becomes syncretised with Nabu.

That's unlikely to get much longer. For one, the "Dictionary of the Old Testament: Wisdom, Poetry & Writings ..." says "Muati, a god about whom we know very little."

Considering the fact that the rules and culture of Wikipedia seem to want people to write like bots, I don't see any issues with letting a real one do the writing for us!

> The only people who would need this information are botanists, and they already have their own specialized sources.

So, that's a nice benefit for botanists (particularly amateur or student botanists), with only a very minor cost imposed on everyone else (namely, the slight namespace pollution, but that's very unlikely to manifest itself). Sounds alright to me.

Reading the article, I think some of it is more of a "why?" than a "why not?", since the articles are factually correct but are mostly just lists of facts and not really something you couldn't find in any number of other resources.

For people who do serious article writing, I imagine this might be considered as a "cheapening" their work. For instance, I imagine some editors also resent the notion that encyclopedia editing is somehow reducible to plugging facts into the right templates. Of course the bot's authors don't really believe they are creating articles as high-quality as good human-edited entries, but the emotional reaction on the part of other editors is at least something I can comprehend.

I'm not really in tune with Wikipedia, so this is mostly conjecture.

edit: reverted edit, added first sentence of 2nd paragraph.

> "and not really something you couldn't find in any number of other resources."

The same should be able to be said about everything on Wikipedia, since Wikipedia is not supposed to have original research and should have a source for everything.

The "Why?" would then be "Because it is better to search for [obscure butterfly] and find a short list of fact than to search for that butterfly and find nothing at all."


A stub also lowers the barrier of entry for new users wanting to add an obscure butterfly they've just tracked down.

> The same should be able to be said about everything on Wikipedia

I don't think so. Well-written encyclopedic entries are in far shorter supply than bare lists of facts.

I guess the fundamental difference of opinion is between those who feel Wikipedia is an encyclopedia, and those who feel it's a dumping ground for human knowledge. Note that I'm not taking sides, just trying to explain the root causes for the difference of opinion.

Also, on a purely technical note, I very much doubt that you couldn't find the information in bot-generated articles anywhere else using a search engine. If that were the case, where are the bots getting the data?

Bots could be concatenating two or threes different sets to create one stub per butterfly.

Or bots could be taking something in a weird set of scans, OCRing that, and then putting it in a stub. This would be troubling unless there was a human checking the quality of the OCR.

There is plenty of stuff that is public domain and not online in a useful form.

Problem is that for any obscure organism you'll need to know what it is before you can search for it on WP. There is no ability to do that on WP, if the articles on the species has an image the chances are good that it will be a related species. Then there is the problem of some one coming along and adding mangled 'facts' to the article or 'facts' derived from 19th century works.

> ...Reading the article, I think some of it is more of a "why?" than a "why not?", since the articles are factually correct but are mostly just lists of facts and not really something you couldn't find in any number of other resources.

The argument I'd make for "why?" is that Wikipedia is more accessible and more reliably available than most other resources. I mean, if the government of the Philippines had a web-based, up-to-date list of towns with some basic information, it might make sense to offload the effort of maintaining that information to them. As it stands, though, not even the US has such a directory -- so Wikipedia picks up the slack (or at least it does for towns in the US).

I'm normally a big fan of including any and all accurate information on Wikipedia. However, with this many articles, I'd be concerned about the ability of any human editors to actually notice malicious misinformation. If a random person edited one of those arcticles to change an obscure fact to an incorrect statement, would anyone notice?

That's a problem Wikipedia has in general; improvements to Wikipedia's botany coverage are not going to meaningfully change it one way or the another.

I'm completely pro-bots but it does seem this one writes way too short articles with very little content. Bots need to do a similar-enough job to humans while being fast and reliable.

Not spam the wiki with names and basically no info.

> There are literally thousands of articles here that read "X is a species of grass. It got its name from Y and is described in Z catalog." The only people who would need this information are botanists, and they already have their own specialized sources.

Except that those articles seem to have good infoboxes. Such structured informations are very useful for many purpose. For instance it is used to build ontologies based on the data on dbpedia/wikidata. These datasets help constructing better semantic tools (like translators etc.). So it's still pretty useful.

Looking at [1], the following can be found listed under sources:

^ <![CDATA[Scribn. & Merr.]]>, 1901 In: Bull. Div. Agrostol. U.S.D.A. 24: 26

Maybe he should have done more testing before spamming Wikipedia with mess like that.

[1] https://sv.wikipedia.org/wiki/Eutriana_repens

So where is the deletionist bot to counter his spam?

I would rather have a bot written stub than nothing at all when I search in Wikipedia. Like it is mentioned in the story, some subjects are under represented while others are super saturated.

If Wikipedia were the only website I'd agree, but I don't find it useful for Wikipedia to have stubs that are simple copies of other freely available sources (especially more authoritative ones), without some kind of synthesis or value-add. In the case of species, for example, I think there is little value in a Wikipedia stub that is just a reformatted copy of the ITIS entry (http://www.itis.gov/). If that's what I wanted, I'd just go to ITIS. When I see a Wikipedia result in a Google search result I typically expect it to be the basic taxonomic information one would find in ITIS plus something more. Otherwise it feels like some of that autogenerated SEO-style spam, which Google should penalize.

It also doesn't always really help in jumpstarting future improvements, if the structure doesn't align with the granularity that makes sense for an encyclopedia article. For example if there is a genus with three species, each of which is very similar and has very little distinctive written about it, the normal organization would be to write one article on the genus, with a short discussion of each subspecies in the main article, not broken out into three separate duplicative and near-empty articles. You'd only break out into separate articles on each species if there's enough to write about them that covering them as a group becomes unwieldy (this varies widely by species). If I were to hazard a prediction, it's that the English Wikipedia will as a result tend towards better organized species coverage than the Swedish Wikipedia, which will never get around to reorganizing these articles.

> If Wikipedia were the only website I'd agree, but I don't find it useful for Wikipedia to have stubs that are simple copies of other freely available sources (especially more authoritative ones), without some kind of synthesis or value-add.

The value-add is that I know about Wikipedia, but not about whatever more authoritative botanical site you're mentioning.

That is why we search. With the nearly empty wikipedia page in place, you might never be driven to find the better source.

If I'm not driven to find a better source, then presumably I was satisfied with the information on Wikipedia. Quick reference for the most relevant information is what Wikipedia is good at.

> Otherwise it feels like some of that autogenerated SEO-style spam, which Google should penalize.

I assumed Google already recognized these near-duplicates and are penalizing them

Exactly. If nothing else, those stub articles serve to give a (not complete, yet more accurate) index-able list of topics not yet adequately covered. Like a /* TODO */

That's exactly the point of the stub articles isn't it? They specifically mention that you should improve it by adding relevant info if possible.

The harm in using bots in this way, I suppose, would be that correcting potential misinformation would be much more work intensive than adding the information in the first place.

It seems to me that the wikipedia model works because/when corrections are at most as costly as introducing the error.

I think this is a good thing, but ideally no bot would be required: if no article in your language, show all the facts from wikidata, eg http://tools.wmflabs.org/reasonator/?q=Q1339

There's a proposal to create 'virtual' articles from Wikidata: http://meta.wikimedia.org/wiki/Wikidata/Notes/Article_genera...

The benefit of this is that articles would be dynamically synthesised from the latest data when the user requests them, and not actually created and added to the wiki. This would prevent the creation of a potentially infinite number of articles on subjects not significant enough to merit a write-up by a human author, so could be a way to combat the 'bot inflation' of article numbers.

I don't think anyone's started work on it yet, but if someone fancies it...

The only thing that pisses me off is that when I contribute I'm expected to write half a site in addition to such an infobox, otherwise a mod comes along and movees it into my sandbox.

That's the main reason I always stopped editing Wikipedia right away after I tryed.

Exactly, agreed this is why fringe projects are created for more opinionated subjective matter or cultural matters.

If you love contributing to wikipedia style knowledge, @Localwiki is often the anti-wikipedia, about local relevant knowledge but colloquial prose is usually OK as long as its factual and not malicious.


Increasing the barrier to entry, generally never helps inclusion or is newbie friendly.

Yes - he should integrate all external data as WikiData and add articles referencing the wikidata entries.

And if the external botanic data is updated he should update the wiki data ... and not only the less-structured articles.

So what are the arguments against what he is doing?

Seems fine to me, and sounds like it's likely to be adding value.

The argument against would be that 2.7M articles containing no useful content is not helpful. (Assuming, which I do, that the articles do not contain useful content).

They contain loots of useful information

Here is an article from an earlier comment, https://sv.wikipedia.org/wiki/Eutriana_repens

After reading that I know that Bouteloua repens is a form of grass. Who first identified it. Who gave it is current name and some other common names for the plant.

I think it is amazing. If someone has already filed in a the information it just seams like a waste of man hours to retype it to Wikipedia.

Quote: His ability to document relatively obscure facts helps him combat one of the biggest problems he sees in the Wikipedia community. Many entries, he argues, are made by white male "nerds."

Quote: "It saddens me that some don't think of Lsjbot as a worthy author," he said. "I am a person; I am the one who created the bot. Without my work, all these articles would never have existed."

And just how is he, a white male nerd, combatting the problem?

Some of these are borderline useless, but I can definitely see the point in bots pulling structured data into templates IF the data is rich enough to gain from being in human-readable format OR if the topic is noteworthy enough to warrant a stub, that is, it is likely to become a hand written article in the future.

I used to like watching the article counts on this page: http://meta.wikimedia.org/wiki/List_of_Wikipedias

But then the Dutch started using bots to inflate their article count... which was ok for a year or two but then other wikipedias started doing the same thing. Now I watch edit number of edits.

I don't care that bots are used, but article counts are completely useless in comparing wikipedias and it didn't used to be that way. Sure there were better measurements, but article counts were still pretty good.


    Wiki "size" = [Number of Articles] - [Number of Stub Articles]
Is a human-written stub article about minor characters from some fantasy book more valuable than a robot-written stub article about a type of grass? Is either article worth considering when trying to create some sort of metric for "wiki health"?

Is this metric actually useful for anything, other than as a curiosity? If I were multi-lingual and was trying to decide which wiki to use, I'd go with whichever wiki was in my primary language. If I did not find the information I was after, I would check the other wikis for the same article (nicely listed on the left portion of the screen).

I think this is going to be more common in the future. Not only for Wikipedia, but for every page that serves some kind of content.

Perhaps even sites like 9gag or similar could start out with some computer generated memes ;)

Ha! yeah I thought about a hack that just gamified random scraped words with memes and see what rises to the top and receptive to humans.

But still its two different horses, Wikipedia is suppose to be a central repository of community created knowledge. The veracity of the information and expertise of authors is what secured it as a credible popular source.

Would we say the same thing if this was a central repository of just Stubs that were computer generated from the get-go?

I remember on the original mother of all Wiki's there was a vigorous discussion about walled gardens and a link on the front page warning about such: http://c2.com/cgi/wiki?WalledGardens

If all those botanical pages aren't linked by any other Wikipedia pages it seems like a single link to a "Botanipedia" would be just as well.

Is the source of his bot available?

I know there are tons of wiki-bots, I'm very interested in the 1st half of his code, the scraping piece.

This is awesome.

I wonder if he would considering adding some element of readability algorithms to it.

Why doesn't this guy just package up the database, and autogenerate the page if the page is blank. That would have more utility than mixing human and bot generated articles.

Unless there's some nuance that I'm missing, it sounds like that that is exactly what he's doing.

I don't think that's what grizzles is saying, I think the idea was to patch the wiki software to generate these article stubs on the fly from the actual source instead of batch importing them once.

Sounds like a good idea to me, but I don't know the Wikipedia culture so they might have reasons against that.

Why is he doing this in Swedish and two versions of Filipino? He apparently speaks English, so I assume he could adapt his code to do English entries as well.

I guess he only has a finite amount of time and energy. Converting the bot might be the easiest part, but then, you also need to convince the English 'pedians.

Also I would imagine many of the articles in English may have been already seeded as in general Swedish and Tagalog are more esoteric.

The English wikipedia is less desperate for article counts because it's already got a huge number. Wherease the other language wikipedias are always looking for more content in their specific language. So, in essence, you could say that he'll get more issues by doing this on the English wikipedia.

Ebglish WP has more people a ailable to give him a hard time for what he's doing.

Bots sometimes have trouble on EN.Wikipedia - see for example the megabytes of discussion generated by the fair-use image bot which was following existing WP rules and helping WP follow the law.

This Swedish guy needs to cut it out. Wikipedia doesn't need more articles that are just "BORK BORK BORK"!

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact