Wikipedia's markup is just terrible for trying to do any sort of scraping or ana...

ryanjshaw · on April 23, 2018

How hard did you try?

These are not hand-written pages, and their output is actually pretty clean compared to the crazy things I've tried to scrape. They have tons of APIs to access the backing data, and that should be your first stop.

Even if you insist on scraping, in your case you're just looking for a <td> whose immediately preceding <td> contains the text "Latest Release", and that's something any XPath-based scraper can give you straight out of the box[1].

A more resilient choice, if you still insist on (or have to use) scraping, is use the underlying template data - a regex is good enough in that case[2]

[1] e.g. http://html-agility-pack.net/ [2] https://en.wikipedia.org/w/index.php?title=Template:Latest_s...

squeaky-clean · on April 23, 2018

> Even if you insist on scraping, in your case you're just looking for a <td> whose immediately preceding <td> contains the text "Latest Release", and that's something any XPath-based scraper can give you straight out of the box

Sure, until it changes. Here it is in Jan 2016 when it was included in the opening paragraphs as the text "The latest version of OS X is <scrape here>".

https://en.wikipedia.org/w/index.php?title=MacOS&oldid=69769...

And here it is in June 2016 where that sentence was changed to "The latest software version is <scrape here>"

https://en.wikipedia.org/w/index.php?title=MacOS&oldid=72528...

Then around September 2016 it was moved into the sidebar using the template you linked. It looks like that template has been a consistent and reliable element for this since 2012, but then why was it only used in the OSX infobar in middle/late 2016? How else would anyone have known to find it?

And this is just OSX, what if OP wanted a web page that tracked the latest stable release for 20 different OS's? It ends up being pretty frequent maintenance for a small project.

laumars · on April 23, 2018

I can't help feeling like there is a lot of tool blaming happening when the wrong tools were used in the first place. Wikipedia is pretty easy to scrape general blocks of text (I'm the author of an IRC bot which did link previewing, inc Wikipedia) but if you need specific, machine readable, passages which aren't going to change sentence structure over the years then you really should be getting that information from a proper API which cateloges that information. Even if it means having to buikd your own backend process which polls the websites for the 20 respective OSs individually so you can compile your own API.

Using an encyclopedia which is constantly being updated and is written to be read by humans as a stable API for machines is just insanity in my honest opinion.

jancsika · on April 24, 2018

> I can't help feeling like there is a lot of tool blaming happening when the wrong tools were used in the first place.

Well, let's be fair: it's a bit surprising that a series of clear, readable key/value pairs in that Wikipedia "MacOS" infobox table can't be delivered by their API as JSON key/value pairs.

Using their API I can generate a JSON that has a big blob sandwiched in there. With the xmlfm format[1] that same blob has some nice-looking "key = value" pairs, too. Funny enough, those pairs for some reason exclude the "latest release" key.

Anyway, is there any case where a <tr> containing two columns in the Wikipedia infobox table doesn't hold a key/value pair? That just seems like such a valuable source of data to make available in simple JSON format.

[1] https://en.wikipedia.org/w/api.php?action=query&prop=revisio...

always_good · on April 23, 2018

Agreed. It's some mix of the XY problem plus the self-entitlement of "if I had the idea, then it should work."

Yet the classic HNer confuses this for inherent weaknesses in the underlying platform that they then need to share lest someone has something good to say about the platform. And they'll often be using words like "terrible", "garbage", and "I hate..."

squeaky-clean · on April 23, 2018

Please, no one said Wikipedia was terrible. You're taking statements out of context. The original comment said:

> Wikipedia's markup is just terrible for trying to do any sort of scraping or analysis.

I'd like to emphasize the "for trying to do any sort of scraping or analysis." Should we instead lie and say it's wonderful for scraping?

It's not an insult, it's the truth. If you want to build an app that automatically parses Wikipedia, it will not be easy.

laumars · on April 24, 2018

But again, that's the wrong tool for the job so of course it's not going to be well suited. When it's that obvious of a wrong tool saying it's terrible is still kind of silly. It's like saying hammers are terrible at screwing things or cars make terrible trampolines.

squeaky-clean · on April 23, 2018

> Even if it means having to build your own backend process which polls the websites for the 20 respective OSs individually so you can compile your own API

One caveat there, a page like that for MacOS doesn't exist. Scraping Wikipedia may be insane, but it's often the best option. You can scrape macrumors or something, but then you're still just parsing a site meant to be read by humans. You also still risk those 20 OS websites changing as much as Wikipedia.

laumars · on April 24, 2018

Indeed but I was thinking of endpoints that have remained relatively static because they have been auto generated or a history of scraping. Some Linux distros have pages like that (even if it's just a mirror list).

But my preferred solution would be using whatever endpoint the respective platform uses for notifying their users of updates.

This strikes me as a solved problem but even if you can't find a ready to use API then I'd probably sign up to a few mailing lists, update my own endpoint manually and offer 3rd party access for a modest subscription.

Either way, scraping an encyclopedia for an English phrase to parse strikes me as the worst of all the possible solutions.

rexaliquid · on April 23, 2018

At the very least, parsing the Release History table seems way better than looking for a particular phrase in the text.

TuringTest · on April 23, 2018

Have you tried using the mediawiki API [1], or any of the alternatives?[2]

I don't know how well they work, but the built-in parser should give you the text without markup. And since they switched to Parsoid [3] to support the Visual Editor, the've polished the wikitext formal specification so all instances of markup have a well-defined structure.

[1] https://www.mediawiki.org/wiki/API:Parsing_wikitext

[2] https://www.mediawiki.org/wiki/Alternative_parsers

[3] https://www.mediawiki.org/wiki/Parsoid

rspeer · on April 23, 2018

I've also had to parse Wikitext. The fact that there are 54 parsers in various states of disrepair listed here (and I have written a 55th) is not because people really like reinventing this wheel; it's because the complete task is absolutely insurmountable, and everyone needs a different piece of it solved.

The moment a template gets involved, the structure of an article is not well-defined. Templates can call MediaWiki built-ins that are implemented in PHP, or extensions that are implemented in Lua. Templates can output more syntax that depends on the surrounding context, kind of like unsafe macros in C. Error-handling is ad-hoc and certain pages depend on the undefined results of error handling. The end result is only defined by the exact pile of code that the site is running.

If you reproduce that exact pile of code... now you can parse Wikitext into HTML that looks like Wikipedia. That's probably not what you needed, and if it was, you could have used a web scraping library.

It's a mess and Visual Editor has not cleaned it up. The problem is that the syntax of Wikitext wasn't designed; like everything else surrounding Wikipedia, it happened by vague consensus.

Vinnl · on April 23, 2018

I hit the same thing recently, but that's basically what Wikidata was founded for - and I'm sure it has the latest version of macOS. It's really easy to fetch Wikidata data using the Wikidata API (my example: https://gitlab.com/Flockademic/whereisscihub/blob/master/ind... )

mmarx · on April 23, 2018

If you're just interested in a single value, using the SPARQL endpoint[0] is probably still more simple, since you don't have to filter out deprecated statements, for example.

[0] http://tinyurl.com/ya957wem

IAmEveryone · on April 23, 2018

I have a lot of goodwill toward wikimedia, but trying to use wiki data made me question my life choices. It doesn’t help that the official API endpoint Times out for anything mildly complicated. (As in a simple aggravation or sorting in the query)

saagarjha · on April 23, 2018

Unfortunately that data isn't granular enough for what I need: I'm looking for the build number, which Wikipedia somehow keeps up to date.

Someone · on April 23, 2018

https://en.wikipedia.org/wiki/MacOS has a footnote after the build number. That led me to https://developer.apple.com/news/releases/. I guess that doesn’t do semantic markup, and I didn’t look at the html at all, but it looks like it could provide you what you want fairly easily (likely not 100% reliably if automated, but chances are Wikipedia‘s somehow keeps up to date involves humans, too)

Alternatively, buy a Mac, set it to auto-update, have cron or launchd reboot it reboot it twice a day, and read the version info from the CLI after rebooting (https://coderwall.com/p/4yz8dq/determine-os-x-version-from-t...)

maxerickson · on April 23, 2018

The grand vision seems to be that you would retrieve it from Wikidata (https://www.wikidata.org/wiki/Q14116 ).

Of course that is out of date and not in sync with the Wikipedia article. But there's public query services you can use to fetch stuff from there, you wouldn't need to parse html.

StrangeSound · on April 23, 2018

Can't you use their API? https://en.wikipedia.org/api/rest_v1/#!/Page_content/get_pag...

squeaky-clean · on April 23, 2018

I haven't used their API in a long time, but I don't think there is a reliable way to get the sidebar that isn't worse than HTML.

For example https://en.wikipedia.org/w/api.php?action=query&prop=revisio...

returns

    {{Infobox website | name = Hacker News | logo = File:hackernews_logo.png | logo_size = 100px | type = [[News aggregator]] | url = {{url|https://news.ycombinator.com/}} | screenshot = File:hn_screenshot.png | registration = Optional | programming_language = [[Arc (programming language)|Arc]] | founder = [[Paul Graham (computer programmer)|Paul Graham]] | launch date = {{start date and age|2007|02|19}} | current status = Online | owner = [[Y Combinator (company)|Y Combinator]] | language = [[English language|English]] }} '''Hacker News''' is a [[social news]] website focusing on [[Computer Science|computer science]] and [[Startup company|entrepreneurship]]. It is run by [[Paul Graham (computer programmer)|Paul Graham]]'s investment fund and startup incubator, [[Y Combinator (company)|Y Combinator]]. In general, content that can be submitted is defined as "anything that gratifies one's intellectual curiosity".<ref>{{cite news | first = Paul | last = Graham | title = Hacker News Guidelines | url = http://ycombinator.com/newsguidelines.html | accessdate = 2009-04-29 }}</ref>

Which isn't very easy to parse either. From a cursory search, a better format doesn't seem possible without using a 3rd party like dbpedia.

https://www.quora.com/Does-the-Wikipedia-API-give-structured...

I tried it for OSX, and it actually just returns a redirect statement to MacOS. So expect your tool consuming the API to break if you don't handle that in advance.

https://en.wikipedia.org/w/api.php?action=query&prop=revisio...

And then when trying it for MacOS, I can't actually find the version info anywhere in the response data. So you couldn't even get that data without scraping the page.

https://en.wikipedia.org/w/api.php?action=query&prop=revisio...

tim333 · on April 24, 2018

I think some of your issues are just inherent to the fact it's a wiki rather than the design of the markup. I mean I could edit the page just now from "Latest release" to "Latest version" or some such - it's just how wikis are.

shakna · on April 23, 2018

Oh, I know. The markup is incomprehensible, but not to a render engine. It doesn't even seem to impact loading speed. It generates amazing machine-text.

As for scraping... Parsing the hell that is wikitext is all you can do. Or apparently, pipe it through a text browser.

saagarjha · on April 23, 2018

> pipe it through a text browser

That's an interesting idea, and one that I hadn't thought of, but I'd place it closer to matching HTML with regex than actual "parsing". I might use it if I'm really desperate though.