If nobody else was going to create an article about some species of butterfly, I don't see why adding that information would be harmful to Wikipedia. Does it make Wikipedia harder to read? Harder to search?
I don't think "it's not written by a human" is a valid argument for factual information, and I've never seen any evidence to suggest that it should be one.
EDIT: I found this bot's edit log! https://sv.wikipedia.org/w/index.php?title=Special:Logg/Lsjb...
Here are a few articles randomly picked out of the latest 1000:
After looking at these, I'm beginning to see why there is some backlash. There are literally thousands of articles here that read "X is a species of grass. It got its name from Y and is described in Z catalog." The only people who would need this information are botanists, and they already have their own specialized sources. I'm still not against bot-produced content, but I understand why some people oppose initiatives like this.
These prose versions are now going to steadily fall out of sync with the original databases, be much more prominent in Wikipedia and Google, diverge from each other, be harder to parse and perform any complex analysis on (a database is at least relatively comprehensible, but to parse his dumps you have to hope you can reverse-engineer it, no other bots or editors have modified it much, and that he didn't get clever with his format strings), etc. If at some point one wanted to change something about the presentation, it's no longer a matter of editing one template and now the user-friendly HTML view onto the database is automatically updated for all viewers, now one has to run a carefully-written bot on millions of articles (and since that is beyond semi-automated bots, you have to have special permission to run it).
It would have been better to work on merging databases or exporting them into a structured site, something like Freebase.
I still think the article is useful as is, with just the map, data sheet, and demographics, and of course many incorporations have additional human-composed information added.
I could imagine some more structured data source, where the main article redirects to a table and scrolls to the correct spot. I would be fine with that, but as far as I know that concept doesn't exist on Wikipedia.
The structured data source now exists, but how to present its data is still being worked out. You can add information to it now, since the goal is to collect a bunch of structured information and then incrementally figure out how to display it, either on its own or integrated into Wikipedia articles (bot additions here are also very welcome): https://www.wikidata.org
I believe there's going to be some offloading of some structured Wikipedia information so it pulls from Wikidata in the future, instead of being maintained "manually" in articles. For example the geotags that are currently buried in Wikipedia articles' markup will probably be centralized to Wikidata soonish and just pulled from there to display. And infoboxes may be auto-populated from the Wikidata information as well. Sending people to auto-generated stub articles when a "real" article is missing is an interesting idea that might happen longer-term.
I've read pages like that before, and it never once occurred to me that they were anything other than the result of sheer human bloody-mindedness. They're not `exciting', but they're very clearly written in an easily parseable way that doesn't scream ``machine-generated'' to me. If this is indicative, the quality of output of these bots is excellent, and a good use of automation --- let the bots fill out the dry factual stuff, and the humans write the less tangible, non-statistical stuff.
That's not a fallacy, that's just good advice. If this Swedish dude wants praise, he should spend his time doing things which are genuinely good, not dubious and possibly net-negative in the long run.
In many cases, it is very rare for facts like those presented in botanical databases to change: it often means a plant has been recategorized, which is a non-trivial thing to do. It is entirely appropriate for this to be handled manually, given how rare it is.
Your arguments about it being better to work on a structured version present a false dichotomy: it isn't Wikipedia OR a structured version, it is Wikipedia AND structured data that need to be improved.
Why is this not appropriate for Wikidata?
The purpose of Wikipedia is not to be a collection of all the factual information it gather. You or I might wish for it to be one, but that isn't what its creators mean for it to be. Each individual article is expected to meet Wikipedia's guidelines for relevance and to have some minimal level of quality. If it isn't possible to write a good, encyclopedic article on a topic, Wikipedia's stance is generally that it should be deleted (or, if the article is just overly specific and the information is relevant to a broader topic, the article might be incorporated into a section in a more general article).
I think that looking up plant species are exactly the kind of thing people would want to use an encyclopedia for.
If you want these kind of one-sentence descriptors, they would be better served in a specialist publication.
For example, the 1910 Encyclopaedia Britannica's complete entry for "denim" is "(an abbreviation of the serge de Nimes), the name originally given to a kind of serge. It is now applies to a stout twilled cloth made in various colours, usually of cotton, and used for overalls, &c."
The entry for "Gimli" is "In Scandinavian mythology, the great hall of heaven whither the righteous will go to spend eternity."
It's not hard to find more examples. But I don't think people considered the EB less of an encyclopaedia for its use of one-sentence descriptors.
The heart of the matter is that there's precious little difference between a dictionary and an encyclopedia. Indeed, the EB's full name is "The Encyclopædia Britannica: a Dictionary of Arts, Sciences, Literature and General Information"
To double check that it's not limited to the EB, I looked in Harmsworth's Universal encyclopedia. The entry for "fulcrum" is "(Lat. fulcrum, a prop) Fixed point in the mechanical system of a lever about which the lever can rotate. See Lever." The entry for "gumboil" is "Small abscess on the gum arising in most cases from decay at the root of a tooth."
See http://menvall.wordpress.com/2010/09/14/on-wikipedias-attemp... for an analysis of the distinction between the two, and the conclusion that "everything that is included in a dictionary also can be included in an encyclopedia, whereas all that is included in an encyclopedia either can or can’t be included in a dictionary. This relation is, however, completely misunderstood by some editors of Wikipedia."
(Had you written that it wasn't appropriate for Wikipedia, than that's a different issue. I speak now only of the broad category of "encyclopedia".)
Triple-checking, the entry for "gumboil" in Wikipedia (at http://en.wikipedia.org/wiki/Gumboil) redirects to "Intraoral dental sinus". The complete entry is two sentences long:
> Intraoral dental sinus (also termed a parulis and commonly, a gumboil) is an oral lesion characterized by a soft erythematous papule (red spot) that develops on the alveolar process in association with a non-vital tooth and accompanying dental abscess. A parulis is made up of inflamed granulation tissue.
By your definition, this "(almost) one-sentence" article should be removed from WP, no?
Secondly, I didn't demand complete excision with the kind of frothy fervor you're implying. I said a better format for these "X is type Y, discovered by Z, listed in Q" is collating them all in a list format. WP has list article like this aplenty - dense, easily digestible information on similar topics, allowing quick and easy comparison and scanning.
As for your link, one of the bold highlights is "explains subjects in greater detail than a dictionary". Another of the three definitions of 'encyclopaedia' your link provides says "with data on and discussion of each subject identified" (my emphasis). So that's two out of three definitions that quite strongly indicate non-brief articles - your linked article is wrong from it's own source material, and hasn't made the case that dictionary-like brevity is suitable for an encyclopaedia.
What, are you trying to 'catch me out' here? Do you think that's a good quality article? It's a stub, it's not what WP wants to encourage, and it's more like a dictionary definition than either "explaining a subject in greater detail" or "discussion of the subject". Yes, I think it's a bad article for any encyclopaedia - it's quite brief, and full of technical jargon. If you didn't already know the specific jargon, it's completely useless as a "general course of instruction" (the etymology argument from your link). And if you do know the jargon, you have a pretty good chance of working it out from the name alone; the article merely confirms the topic if you're unsure, but you don't get any more insight into it.
As 'trick questions' go, this one sucked.
I ask that you clarify your reasoning.
You say my linked-to reference "hasn't made the case that dictionary-like brevity is suitable for an encyclopaedia". The link isn't trying to make that delineation between the two. It's arguing (and I agree) that a dictionary is a type of encyclopedia, not that they are two different things. You mentioned some quotes, in bold. The author later comments on those exact same quotes (with bold translated to italics):
> These definitions show that whereas dictionary is defined by words alone: “reference work that lists words, usually in alphabetical order, and gives their meanings and often other information such as pronunciations, etymologies, and variant spellings“, encyclopedia is defined either as synonymous to dictionary: “the term is often interchanged with the word “dictionary,” as in the present work” or by a larger extension than dictionary: “explains subjects in greater detail than a dictionary”. There is thus no conflict between dictionary and encyclopedia. They are either synonymous or only have different extensions (i.e., encyclopedia including dictionary, but covering a larger set of phenomena).
I checked with the OED, at http://www.oed.com/viewdictionaryentry/Entry/52325 . It concurs, since its definition 1b. for dictionary is (italics mine):
> In extended use: a book of information or reference on any subject in which the entries are arranged alphabetically; an alphabetical encyclopedia
Yes, I'm saying that the article for "gumboil" in WP is not a stub, does not need to be longer than it is, and very much like what WP should support. While I agree with you in that the older print definition of the term is easier to understand than what WP has, that's at most one more line, and more likely solved by rewriting.
BTW, I also looked up Gimlé in WP. That's three sentences long, so a full two sentences longer than the 'Gimli' entry in Encyclopaedia Britannica.
Why must everything require more than a few lines to fit into your concept of an encyclopedia? Certainly Gimlé doesn't fit in a dictionary, so where else would it go?
I'm talking about the function of an encyclopaedia - which your own link has sources generally requiring non-brief articles. Articles which discuss and expand on a subject. Even the etymology provided is 'general education', which implies more than mere definition of a word.
Yes, super-short articles like 'gumboil' or 'Gimle' should be rolled into larger, more comprehensive articles. There is plenty you could add to gumboil - an image to show one, demographic preponderances, common treatments, common complicating factors, all of which enhance the user's knowledge of the topic. It certainly should be reduced or modified in terms of jargon. As for Gimlé, there's no reason why it can't be rolled into a more comprehensive article on Asgard, Norse Mythology, or whatever. Check out the 'Elysium' article for ways you can expand it to make it a more useful article in its own right.
Another thing that you're missing is that WP (and myself) both view these things as undesirable, but not so undesirable that they should be destroyed as a matter of course. They're just bad articles - and contrary to what you're saying, they're far from complete.
In the case of the 'grasses' links of the OP, these are absolutely terrible articles (the irony being that they're chaff - an appropriately grassy reference). Yes, it's information, but it's very poorly laid out and hard to access or compare. It's the absolute barest information - and far, far from "general education" substansiveness. Cool, Brachiaria plantaginea is a grass, but let's have a look at the entry:
Brachiaria plantaginea  is a species of grass which was first described by Heinrich Friedrich Link, and got its current name of Albert Spear Hitchcock. Brachiaria plantaginea included in the genus Brachiaria, and the grass family.   No subspecies are listed in the Catalogue of Life. (ta, google translate)
There is barely any information here beyond "It's a grass". What kind of grass? Is it grass like crabgrass? Like asparagus? Like bamboo? What are it's characteristics? Where do you find it? Is it peculiar to any animal's diets? How does it propagate? What does it even look like? Does it have defense mechanisms? Does it survive arid climates well? Are humans allergic to it at all? Not to mention that it's self-evident in the name Brachiaria plantaginea that it's in the genus Brachiaria.
It's an awful, very low quality article - regardless of whether or not you think such information belongs in an encyclopaedia, the article quality does not. Do you feel generally educated by that article? Do you feel like the thing that is Brachiaria plantaginea has been sufficiently discussed? Is the article self-contained (ha!) and explained in detail? These three questions are fundamental parts of the definitions of 'encyclopaedia' given by your original link (and which I don't particularly contest - I rather agree with them).
I mentioned dictionaries because I misunderstood you. Thank you for the correction.
"There is plenty you could add to gumboil" is of course true. It's also true for nearly every single deleted item in WP, including those which aren't sufficiently notable. It's also true of nearly every item which is currently rolled into a larger article. (Hence http://en.wikipedia.org/wiki/List_of_recurring_The_Simpsons_... vs. http://simpsons.wikia.com/wiki/Bernice_Hibbert )
This is an eternal debate by WP editors. Well-defined requirements and boundaries are not possible, only rough guidelines for most areas. This is one such area.
I agree that the information about B. plantaginea is weak. This is not atypical of biological entries in WP. Consider http://en.wikipedia.org/wiki/Calabash_tree , which I chose because it's only a few lines longer than the short entry in the 1910 Encyclopedia Britannica. (BTW, the Swedish entry in WP has a picture of the gourd, while the English one does not.)
After 9 years this stub entry still doesn't answer some of your questions, like "Is it peculiar to any animal's diets? How does it propagate? What does it even look like? Does it have defense mechanisms? Does it survive arid climates well? Are humans allergic to it at all?"
Worse is the line "The fruit pulp is used traditionally for respiratory problems." It doesn't say if it's actually effective, and if so, what is the method of treatment. Is it eaten? Smeared on the chest like a mentholated topical cream? Used as a suppository?
Thus your criticisms, while quite valid, should be tempered by context.
As another example, http://en.wikipedia.org/wiki/Hairy_long-nosed_armadillo is also a 4-line stub. If you look in the history you'll see it was once much more informative. This is because it copied text verbatim from http://armadillo-online.org/dasypus.html#pilosus . That source is under the CC by-nc-sa license, while WP does not accept non-commercial only license, so I believe it was rolled back for that reason.
It's relevant that the armadillo page was created by a bot in 2007, in almost exactly the current form. The main issue is likely that WP is a lousy place for species information. Perhaps it's because the primary literature doesn't meet the copyright requirements, and specialists who can create appropriate text are more interested in contributing to specialist compendiums?
For what it's worth, http://www.conabio.gob.mx/malezasdemexico/poaceae/brachiaria... says it's from Florida and Mexico to South America, with secondary distributions in the Old World. Kew gives details in (technical) English at http://www.kew.org/data/grasses-db/www/imp01488.htm .
Neither list anything about allergies, its defensive mechanisms, etc.
My own belief is that this bot information for B. plantaginea, etc. should be in an infobox of some sort, rather than free text. I feel that if I edit the text to include the information I identified, then a future bot sweep may be unable to handle the changes automatically.
e.g. german wikipedia don't allow stubs anymore and the admins delete & reverts more pages every day than new ones are created. It's maybe a cultural problem as such admins identify themself with 'their articles' and don't allow any changes.
> Muati is an obscure local god in the Sumerian pantheon. He is associated in some texts with the mythical island paradise of Dilmun, and becomes syncretised with Nabu.
That's unlikely to get much longer. For one, the "Dictionary of the Old Testament: Wisdom, Poetry & Writings ..." says "Muati, a god about whom we know very little."
So, that's a nice benefit for botanists (particularly amateur or student botanists), with only a very minor cost imposed on everyone else (namely, the slight namespace pollution, but that's very unlikely to manifest itself). Sounds alright to me.
For people who do serious article writing, I imagine this might be considered as a "cheapening" their work. For instance, I imagine some editors also resent the notion that encyclopedia editing is somehow reducible to plugging facts into the right templates. Of course the bot's authors don't really believe they are creating articles as high-quality as good human-edited entries, but the emotional reaction on the part of other editors is at least something I can comprehend.
I'm not really in tune with Wikipedia, so this is mostly conjecture.
edit: reverted edit, added first sentence of 2nd paragraph.
The same should be able to be said about everything on Wikipedia, since Wikipedia is not supposed to have original research and should have a source for everything.
The "Why?" would then be "Because it is better to search for [obscure butterfly] and find a short list of fact than to search for that butterfly and find nothing at all."
A stub also lowers the barrier of entry for new users wanting to add an obscure butterfly they've just tracked down.
I don't think so. Well-written encyclopedic entries are in far shorter supply than bare lists of facts.
I guess the fundamental difference of opinion is between those who feel Wikipedia is an encyclopedia, and those who feel it's a dumping ground for human knowledge. Note that I'm not taking sides, just trying to explain the root causes for the difference of opinion.
Also, on a purely technical note, I very much doubt that you couldn't find the information in bot-generated articles anywhere else using a search engine. If that were the case, where are the bots getting the data?
Or bots could be taking something in a weird set of scans, OCRing that, and then putting it in a stub. This would be troubling unless there was a human checking the quality of the OCR.
There is plenty of stuff that is public domain and not online in a useful form.
The argument I'd make for "why?" is that Wikipedia is more accessible and more reliably available than most other resources. I mean, if the government of the Philippines had a web-based, up-to-date list of towns with some basic information, it might make sense to offload the effort of maintaining that information to them. As it stands, though, not even the US has such a directory -- so Wikipedia picks up the slack (or at least it does for towns in the US).
Not spam the wiki with names and basically no info.
Except that those articles seem to have good infoboxes. Such structured informations are very useful for many purpose. For instance it is used to build ontologies based on the data on dbpedia/wikidata. These datasets help constructing better semantic tools (like translators etc.). So it's still pretty useful.
^ <![CDATA[Scribn. & Merr.]]>, 1901 In: Bull. Div. Agrostol. U.S.D.A. 24: 26
Maybe he should have done more testing before spamming Wikipedia with mess like that.
It also doesn't always really help in jumpstarting future improvements, if the structure doesn't align with the granularity that makes sense for an encyclopedia article. For example if there is a genus with three species, each of which is very similar and has very little distinctive written about it, the normal organization would be to write one article on the genus, with a short discussion of each subspecies in the main article, not broken out into three separate duplicative and near-empty articles. You'd only break out into separate articles on each species if there's enough to write about them that covering them as a group becomes unwieldy (this varies widely by species). If I were to hazard a prediction, it's that the English Wikipedia will as a result tend towards better organized species coverage than the Swedish Wikipedia, which will never get around to reorganizing these articles.
The value-add is that I know about Wikipedia, but not about whatever more authoritative botanical site you're mentioning.
I assumed Google already recognized these near-duplicates and are penalizing them
It seems to me that the wikipedia model works because/when corrections are at most as costly as introducing the error.
The benefit of this is that articles would be dynamically synthesised from the latest data when the user requests them, and not actually created and added to the wiki. This would prevent the creation of a potentially infinite number of articles on subjects not significant enough to merit a write-up by a human author, so could be a way to combat the 'bot inflation' of article numbers.
I don't think anyone's started work on it yet, but if someone fancies it...
That's the main reason I always stopped editing Wikipedia right away after I tryed.
If you love contributing to wikipedia style knowledge, @Localwiki is often the anti-wikipedia, about local relevant knowledge but colloquial prose is usually OK as long as its factual and not malicious.
Increasing the barrier to entry, generally never helps inclusion or is newbie friendly.
And if the external botanic data is updated he should update the wiki data ... and not only the less-structured articles.
Seems fine to me, and sounds like it's likely to be adding value.
Here is an article from an earlier comment, https://sv.wikipedia.org/wiki/Eutriana_repens
After reading that I know that Bouteloua repens is a form of grass. Who first identified it. Who gave it is current name and some other common names for the plant.
I think it is amazing.
If someone has already filed in a the information it just seams like a waste of man hours to retype it to Wikipedia.
Quote: "It saddens me that some don't think of Lsjbot as a worthy author," he said. "I am a person; I am the one who created the bot. Without my work, all these articles would never have existed."
And just how is he, a white male nerd, combatting the problem?
But then the Dutch started using bots to inflate their article count... which was ok for a year or two but then other wikipedias started doing the same thing. Now I watch edit number of edits.
I don't care that bots are used, but article counts are completely useless in comparing wikipedias and it didn't used to be that way. Sure there were better measurements, but article counts were still pretty good.
Wiki "size" = [Number of Articles] - [Number of Stub Articles]
Is this metric actually useful for anything, other than as a curiosity? If I were multi-lingual and was trying to decide which wiki to use, I'd go with whichever wiki was in my primary language. If I did not find the information I was after, I would check the other wikis for the same article (nicely listed on the left portion of the screen).
Perhaps even sites like 9gag or similar could start out with some computer generated memes ;)
But still its two different horses, Wikipedia is suppose to be a central repository of community created knowledge. The veracity of the information and expertise of authors is what secured it as a credible popular source.
Would we say the same thing if this was a central repository of just Stubs that were computer generated from the get-go?
If all those botanical pages aren't linked by any other Wikipedia pages it seems like a single link to a "Botanipedia" would be just as well.
I know there are tons of wiki-bots, I'm very interested in the 1st half of his code, the scraping piece.
I wonder if he would considering adding some element of readability algorithms to it.
Sounds like a good idea to me, but I don't know the Wikipedia culture so they might have reasons against that.
Bots sometimes have trouble on EN.Wikipedia - see for example the megabytes of discussion generated by the fair-use image bot which was following existing WP rules and helping WP follow the law.