Hacker News new | past | comments | ask | show | jobs | submit login
Wikidata or Scraping Wikipedia (simia.net)
178 points by Lockal 61 days ago | hide | past | favorite | 82 comments

This is a great post, which also happens to serve as a good illustration of the "curse of knowledge" and the typical blind-spots of enthusiasts. Consider the timeline of events:

• The blog post on scraping Wikipedia (https://billpg.com/data-mining-wikipedia/ , HN discussion 4 days ago: https://news.ycombinator.com/item?id=28234122 which mentions Wikidata as an alternative etc.

• The author of this post, a Wikidata person, finds this an "extremely surprising discussion", and posts a Twitter thread ( https://web.archive.org/web/20210820105621/https://twitter.c... ) ending with

> I don't want to argue or disagree, I am just completely surprised by that statement. Are the docs so bad? Is the API design of Wikidata so weird or undiscoverable? There are plenty of libraries for getting Wikidata data, are they all so hard to use? I am really curious.

This curiosity is a great attitude! (But…)

• After seeing the HN discussion and responses on Twitter/Facebook, he writes this post linked here. In this post, he does mention what he learned from potential users:

> And there were some very interesting stories about the pain of using Wikidata, and I very much expect us to learn from them and hopefully make things easier. The number of API queries one has to make in order to get data […], the learning curve about SPARQL and RDF (although, you can ignore both, unless you want to use them explicitly - you can just use JSON and the Wikidata API), the opaqueness of the identifiers (wdt:P25 wd:Q9682 instead of “mother” and “Queen Elizabeth II”) were just a few. The documentation seems hard to find, there seem to be a lack of libraries and APIs that are easy to use. And yet, comments like "if you've actually tried getting data from wikidata/wikipedia you very quickly learn the HTML is much easier to parse than the results wikidata gives you" surprised me a lot. […] I am not here to fight. I am here to listen and to learn, in order to help figuring out what needs to be made better.

Again, very commendable! Almost an opening to really understanding the perspective of casual potential users. But then: the entire rest of the post does not really address "the other side", and instead completely focuses on the kinds of things Wikidata enthusiasts care about: comparing Wikipedia and Wikidata quality in this example, etc.

I mean, sure, this query he presents is short:

    select * { wd:Q9682 (wdt:P25|wdt:P22)* ?p . ?p wdt:P25|wdt:P22 ?q } 
but when he says:

> I would claim that I invested far less work than Bill in creating my graph data. No data cleansing, no scraping, no crawling, no entity reconciliation, no manual checking.

he's ignoring the work he invested in learning that query language (and where to query it), for instance. And this post would have been a perfect opportunity to teach readers about how to go from the question "all ancestors of Queen Elizabeth" to that query (and in trying to teach it, he may have better discovered exactly what is hard about it), but he just squanders the opportunity (just as when he says "plenty of libraries" without inviting exploration by linking to the easiest one): this is a typical thing enthusiasts do, which is unfortunate IMO.

When scraping HTML from Wikipedia, one is using general-purpose well-known tools. You'll get slightly better at whatever general-purpose programming language and libraries you were using, learn something that may be useful the next time you need to scrape something else. And most importantly, you know that you'll finish, you can see a path to success. When exploring something "alternative" like Wikidata, you aren't sure if it will work, so the alternative path needs to work harder to convince potential users of success.


Personal story: I actually know about the existence of Wikidata. Yet the one time I tried to use it, I couldn't figure out how. This is what I was trying to do: plot a graph of the average age of Turing Award winners by year. (Reproduce the first figure from here: http://hagiograffiti.blogspot.com/2009/01/when-will-singular... just for fun) One would think this is a perfect use-case for Wikidata: presumably it has a way of going from Turing Award → list of winners → each winner's date of birth. But I was stymied at the very first step: despite knowing of the existence of Wikidata, and being able to go from the Wikipedia page that lists all recipients (current version: https://en.wikipedia.org/w/index.php?title=Turing_Award&oldi... ) to the Wikidata item for "Turing Award" (look for "Wikidata item" in the sidebar on the left) https://www.wikidata.org/wiki/Q185667 I could not quickly find a way of getting a list of recipients from there. Tantalizingly, the data maybe does exist e.g. if I go to one of the recipients like Leslie Valiant https://www.wikidata.org/wiki/Q93154 I see a "statement" award received → Turing Award with "property" point in time → 2010. Even after coming so close, and being interested in using Wikidata, it was not easy enough for me to get to the next step (which I still imagine is possible, maybe with tens of minutes of effort), until I just decided "screw this, I'll just scrape the Wikipedia page" (I scraped the wikisource rather than html). And if one is going to have to scrape anyway, then might as well do the rest too (dates of birth) with scraping.

Thank you. I am the author of the post, and appreciate your comments, and I agree with them.

I have to say that it indeed wasn't my intention to show how to get to the query - that is a form of tutorial that would be great to write too, agreed, and maybe I should have. What I wanted to write is just comparing the results of the two approaches.

Having said that, yes, again, I agree, a tutorial on describing how to get that data would be great too, and maybe I should write it, maybe someone else should. I agree that it is not trivial at all how to get to the query (and that is a particularly tricky query, certainly not what I would begin with).

Thank you again for your comment, it made me think and mull over the whole thing more. I will talk tomorrow with the lead of the Wikidata team, and I will bring these (and many other points that were mentioned in the last few days) with me. It will take a while, but I hope we can improve the situation.

There's a trick companies like Facebook use to try and protect users from copy pasting malicious scripts in devtools: when they detect it opening (probably keyboard event), they print a big scary warning using console.log/error [1]

Assuming the first things most scrapers do is open the site in devtools, this would be a great place to print some text with a page specific Wikidata query that will pull in the exact same information as the current page along with a link to a really good hacker style tutorial + appendix of how to guides. Even better would be an option to turn on some sort of dev mode with mouseover tool tips that show queries for every bit of info on the page. Anything that breaks the feedback loop between the code and the browser will decrease the probability that the scraper will use wikidata. Think of it as a weird inverse user retention problem

[1] https://imgur.com/a/0Xn1qIb

Thank you! And hope there was nothing in my comments that came off the wrong way. A few more comments, since you seem so receptive. :-)

• I do understand why you wouldn't want to have bothered to write a tutorial (it's too much work, there are enough tutorials already, etc). But still, it may have helped to link to one or two, just to catch the curious crowd.

• Specifically: Yesterday I later looked around, and I found this tutorial most inviting (big font, short pages, enough pictures and examples, and interactive querying right on the page): https://wdqs-tutorial.toolforge.org/ — but I couldn't find this tutorial linked from Wikidata or the Wikipedia page on Wikidata; I actually found it in the "See also" section of the Wikipedia page on SPARQL. (After reading this one, the tutorial at https://www.wikidata.org/wiki/Wikidata:SPARQL_tutorial also looks ok to me, but that's the "curse of knowledge" already: I know I wasn't enthused the first time I saw it…)

• In fact, after taking a few (tens of?) minutes to skim through these tutorials, the query here isn't a particularly tricky query, I thought! So it may not be that the query language is "hard" or "difficult"; the challenge is just to get people over that initial bump of unfamiliarity.

• The Wikidata query page (e.g. https://w.wiki/3vrd) already has a prominent big blue button on the left edge, but somehow the first time I loaded the page it still wasn't prominent enough for me to realize to click it. It may be nice if the button were somehow even more prominent, or if loading the page (for shared links) would automatically display the query results (possibly cached). (Or, the big white area where the results appear could say "click to see results here" or something.)

• It may be worth considering making labelled output the default and raw ids something to explicitly ask for, at least in the beginner's version of the query engine.

• In your blog post, even if not writing a tutorial, IMO it would have helped to just explain the query in a line of two, i.e. translate each of the statements into English. (This is less work than teaching someone to arrive at the query themselves.)

• Even if neither writing a tutorial nor explaining the query, IMO it would have helped to just mention something like "Yes, this query is in an unfamiliar language, but it takes only a few minutes to learn: see <here> and <here>" — basically, just acknowledge that there may be some barrier here (however small) for people who don't already know.

• Such things are exactly our blind spots when writing, so it's not easy. The only way I know is to show the writing to some people in the target audience and get feedback. Fortunately, you don't have to ask too many people: these researchers in usability testing say "You Only Need to Test with 5 Users": https://www.nngroup.com/articles/why-you-only-need-to-test-w...

Thanks for your post, ultimately as a result of reading it, and commenting about it and being shown a solution to my problem, in the end now I'm more likely, and better equipped, to try Wikidata in future.

Thank you for the follow up. I updated my post a little, mostly with a link to this discussion, as it contains and explanation of the query, and now also links to tutorial.

I agree with some of your suggestions on making the system easier to use. It's open source, and I hope someone will be motivated enough to give it a try - the development team can only do so many things, unfortunately.

Thanks again for the constructive comments!

About the Turing Award, after some trials and errors, I think this is the request: https://w.wiki/3wmY

Disclaimer: I follow https://www.youtube.com/channel/UCp2i8QpLDnWge8wZGKizVVw / https://www.twitch.tv/belett (mostly in French, sometimes in English).

Without these courses, I wouldn't have been able to write this request.

Thank you, that was educational! At the time I'd have been happy with just getting the data out, so to encourage others, here's a simpler version of the query: https://w.wiki/3x8t

Short version:

    SELECT ?awardYearLabel ?winnerLabel ?dateOfBirthLabel WHERE {
      SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
      ?statement ps:P166 wd:Q185667.
      ?winner p:P166 ?statement.
      ?statement pq:P585 ?awardYear.
      ?winner wdt:P569 ?dateOfBirth.
    ORDER BY (?awardYearLabel)
Annotated version with comments:

    SELECT ?awardYearLabel ?winnerLabel ?dateOfBirthLabel WHERE {
      # Boilerplate: Provides, for every "?foo" variable, a corresponding "?fooLabel"
      SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
      # "Statements" of the form "<subject> <predicate> <object>."
      # also known as "<item> <property> <value>."
      # Variable names start with "?" and  we can think of them as placeholders.
      # For example, a straightforward query that lists winners
      # ("P166" means <award received> and "Q185667" means <Turing Award>):
      # ?winner wdt:P166 wd:Q185667.   # <?winner> <received award> <Turing Award>
      # "Qualifiers" on statements: See 
      #    https://wdqs-tutorial.toolforge.org/index.php/simple-queries/qualifiers/statements-with-qualifiers/
      #    or https://en.wikibooks.org/wiki/SPARQL/WIKIDATA_Qualifiers,_References_and_Ranks
      # A **statement** of the form "<somebody> <received award> <Turing Award>"
      ?statement ps:P166 wd:Q185667.
      # In that statement, the <somebody> we shall call "?winner".
      ?winner p:P166 ?statement.
      # That statement has <point in time> qualifier of "?awardYear".
      # ("P585" means <point in time>)
      ?statement pq:P585 ?awardYear.
      # The ?winner has a <date of birth> of ?dateOfBirth. 
      # ("P569" means <date of birth>)
      ?winner wdt:P569 ?dateOfBirth.
    ORDER BY ?awardYearLabel

?awardYear and ?dateOfBirth are literals, so you don't need to take *Label of them (that's only useful for Qnnn nodes).

Below I use a blank node (since you don't need the URL of ?statement) to simplify the query, and calculate the age as a difference of the two years:

    SELECT ?awardYear ?age ?winnerLabel WHERE {
      SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
      ?winner p:P166 [ # award won
          ps:P166 wd:Q185667; # Turing award
          pq:P585 ?awardDate]; # point in time
        wdt:P569 ?birthDate.
      bind(year(?awardDate) as ?awardYear)
      bind(?awardYear-year(?birthDate) as ?age)
    ORDER BY ?age

And today I learned Donald Knuth was the youngest Turing award winner at the age of 36. I'm going to have to go learn SPARQL.

I think this is an interesting case because scraping this is easy (just one page) where the wikidata query requires dealing with modifiers which is a bit more complex.

(It requires the birth dates, so it is more than one page)

The HTML structure may change over time: if the request is executed few times over a long period, the scrapper may/will require more maintenance than the SPARQL request.

For example, the same wikipedia page 3 years ago is slightly different: https://en.wikipedia.org/w/index.php?title=Turing_Award&oldi...

"The HTML structure may change over time..."

A very common argument in HN comments that discuss the merits of so-called web APIs.

Fair balance:

Web APIs can change (e.g., v1 -> v2), they can be discontinued, their terms of use can change, quotas can be enforced, etc.

A public web page does not suffer from those drawbacks. Changes that require me to rewrite scripts are generally infrequent. What happens more often is websites that provide good data/information sources simply go offline.

There is nothing wrong with web APIs per se, I welcome them (I use the same custom HTTP generator and TCP/TLS clients for both), but the way "APIs" are presented, as some sort of "special privilege", requiring "sign up", an email address and often more personal information, maybe even payment, is for the user, cf. developer, inferior to a public webpage, IMHO. As a user, not a developer, HTTP pipelining works for me better than many web APIs. I can get large quantities of data/information in one or a small number of TCP connections (I never have to use use proxies nor do I ever get banned); it requires no disclosure of personal details and is not subject to arbitrary limits.

What's interesting about this Wikidata/Wikipedia case is that the term chosen was "user" not "developer". It appears we cannot assume that the only persons who will use this "API" are ones who intend to insert the retrieved data/information into some other webpage or "app" that probably contains advertising and/or tracking. It is for everyone, not just "developers".

The semantics of RDF identifiers drift at least as often as HTML format changes.

For example, at one point I was doing a similar thing against DBPedia (a sort-of predecessor to WikiData).

I was doing leaders of countries. But it turns out "leader" used to mean constitutional leadership roles, and at some point someone had decided this included US Supreme Court Chief Justice (as the leader of the judicial branch).

So I had to go and rewrite all my queries to avoid that. But most major countries had similar semantic drift, and it turned out easier to parse Wikipedia itself.

DBPedia extracts data from wikipedia (infoboxes, tables) and other sources (wikidata). The circle is complete



I also had a horrible experience using the recommended SPARQL interface to query Wikidata. The queries were inscrutable, the documentation was poor and even after writing the correct queries, they timed out after scanning a tiny fraction of the data I needed, making the query engine useless to me.

However, I had great success querying Wikidata via the "plain old" MediaWiki Query API: https://www.mediawiki.org/wiki/API:Query. That API was a joy to work with.

Wikidata (as a backing store for Wikipedia and a knowledge graph engine) is a very powerful concept. It's a key platform technology for Wikipedia and hopefully they'll prioritize its usability going forward.

The WD SPARQL editor has auto-complete (eg type "wdt:award" and press control-space) and readout on hover.

To make the query more readable, use some comments (see my query above).

Yes, WD SPARQL has a firm timeout of 1 minute then may even cut out the response in half. I think it's falling victim of its own popularity (the API is imho much less popular).

There are optimization techniques that one can use, but they take some experience and patience. One good way is to use federated SPARQL insert to a local repo (assuming you want to selectively copy and reshape RDF data), eg our GraphDB repo has batching of federated queries that avoids the timeout.

> When scraping HTML from Wikipedia, one is using general-purpose well-known tools. You'll get slightly better at whatever general-purpose programming language and libraries you were using, learn something that may be useful the next time you need to scrape something else. And most importantly, you know that you'll finish, you can see a path to success. When exploring something "alternative" like Wikidata, you aren't sure if it will work, so the alternative path needs to work harder to convince potential users of success.

I'm not sure its that clear. Scrapping is pretty generic, but SPARQL is hardly a proprietary query language - other things use it. If what you're into is obtaining data, sparql might more generically apply than scrapping would. It really depends on what you are doing in the future. At the very least if you do scrapping a lot, you're probably going to reinvent the parsing wheel a lot. To each their own.

> he's ignoring the work he invested in learning that query language (and where to query it), for instance

And Bill is ignoring the work of learning how to program. None of us start from nothing, and its not like any of this is trivial to learn if you've never touched a computer before.

And to be clear i'm not objecting - there is nothing wrong with using the skills you currently have to solve the problem you currently have. Whatever gets you the solution. If you're querying wikidata (or similar things) everyday, learning sparql is probably a good investment. If you're interested in sparql, then by all means learn it. But if those dont apply, then scrapping makes sense if you already know how to do that.

> [Scraping] is pretty generic, but SPARQL is hardly a proprietary query language - other things use it. If what you're into is obtaining data, sparql might more generically apply than [scraping] would. It really depends on what you are doing in the future.

Yes my point exactly! My point was that even when trying to consider the perspective of people different from us, we can end up writing for (and from the perspective of) people who are "into" the same things as us. Casual users like in the original scraping post are not much really "into" obtaining data, which can be a blind spot for enthusiasts who are. The challenge and opportunity in such cases is really communication with the outside of the field, rather than competition within the field.

"Scrapping" is like nails scrapping on a chalkboard for me.

> he's ignoring the work he invested in learning that query language (and where to query it), for instance

>And Bill is ignoring the work of learning how to program.

I suppose if you didn't know how to program you wouldn't learn Sparql. So the investment in learning how to program has already been made.

There are plenty of Wikidata users who have learned some SPARQL without being programmers.

Why not? People sometimes learn SQL without learning to program, why not sparql?

Because SQL looks and is more simple: plain English words that are easily recognized, with basic queries (select from) that can be taught in less than an hour and then build on it. Now let's look at SPARQL: everything screams at technicality. Curly braces (I'm not sure a non-programmer even know how to type this). Then the variable name prefixed by ?. Then the need to understand what is an URI and how and why prefixes are declared, not to mention the sheer fact of using URI instead of a simple names such as one found for database columns. But even that isn't enough knowledge to start writing the simpliest query. One also need to be taught about RDF triples.

So no, every query languages are not born the same. SPARQL is overly technical and requires a lot of knowledge to do even the simpliest things.

I like your reason more than mine.

Non technical persons just learn SPARQL and the principles of Wikidata, and extract data. For them SQL, REST and JSON is much too technical.

Well one reason why someone might learn SQL without learning how to program is that you can get jobs for it.

Ah, but the response might go, lots of people learned SQL when there weren't a lot of jobs for people who knew SQL.

Yes, my response would be, but that was a long time ago and the incentives for people to learn technologies have changed, and I do not think a significant amount of people will learn SQL without learning to program henceforth; at least not amounts significant enough that anyone will say "Well look at that trend!".

here there can be several responses so I won't go through all the branches, but in the end I don't think there is going to be an interest in learning Sparql in people who are not programmers or at least programming adjacent professions, and from what I see there hasn't been that much interest from people who are programmers.

Absolutely spot-on. It makes me think of my own experience.

I've worked for a few niche search engines. Some sites have APIs available so that you don't have to scrape their data. But often times, since we were already used to scraping sites, we wouldn't even notice that an API was available. In a few number of cases, an API _was_ available, but it was more restrictive or complicated than it was for us to just scrape a page. That's not to say that we never used them, because we certainly did. Just that we often were never aware that they were an option since they were not very common in our cases.

Not to mention that APIs come with registration, credentials, rate limiting, throttling, etc.

Wikidata's API doesn't require registration, credentials, etc.

I'm one of the comments quoted in that chain of tweets, heh. Here's my specific example. This was years ago, so I don't remember much anymore and things may have changed. But I did now just give it a basic attempt and it still seems Wikipedia is easier than Wikidata. (I did put more effort into using Wikidata when I tried years ago, but all I really remember is it wasn't as fruitful as just fetching wikipedia).

My goal, a list of every airport on wikipedia with an IATA code and the city it is attached to. There is a perfect wikipedia page to start this off on, while as far as I can tell, wikidata does not have any of the data from the table on that page?



I like that geospatial join you have there. Really it should be two query tabs and an interactive map.

I have often wanted a geofilter around my wikipedia search, esp when I am on vacation. Basically, give me every wikipedia page that ever talked about anything within 50km of here. And then one could filter down or have a personal recommendation system boost stuff you like.

I hope this helps with getting started: https://w.wiki/3x3n

And here's a visualization on a map, using geocoordinates: https://w.wiki/3x3g

Thanks, the queries are very powerful, but it still seems like this data is not as usable as the data in the HTML table. Any airports that don't have wikipedia links for the airport or city don't get picked up, and there are disagreeing duplicates in the wikidata that the HTML does not have.

For example (AKG) Anguganak Airport and city Anguganak don't have an article so they don't appear in the wikidata. ALZ doesn't appear in the data because Lazy Bay does not have an article page. There are some duplicate entries, with different cities or airport names like AAL, AAU, ABC. ABQ has 4 different entries. The data also is out-of-date in some instances. "Opa-locka Airport" was renamed to "Miami-Opa Locka Executive Airport" in 2014 for example. In the HTML table all these issues are solved.

Thanks for the answer!

I got the query wrong (reason: https://twitter.com/vrandezo/status/1430206988177219593 )

Here's the corrected query: https://w.wiki/3x8u

This includes a few more thousand results.

AKG does show up (but has indeed no connection to Anguganak), ALZ shows up (again, without a connection to a city). Article pages are not a requirement for the data to be in Wikidata.

I see your point. The duplicate entries can often be explained (e.g. ABQ is indeed the IANA code both for Albuqerque Sunport and the Kirtland AF Base, which are adjacent to each other), but that's already a lot of detail.

If a single table provides the form of clean data one is looking for, that's great and should be used (and slightly different than the original question that triggered this, where we had to go through many different pages and fuse data from thousands of pages together). Different tasks benefit from different inputs!

> no entity reconciliation

On the other hand there are still duplicates. I queried Wikidata once and every date result was duplicated because they existed in a slightly different format (7-7-2000 vs 07-07-2000; both were declared as xsd:date). Very "semantic" and powerful data model indeed. In fact the technology should be renamed stringly typed web, because this is what it really is.

That would be a bug and should not be the case. I just tried it and couldn't replicate it. There is no difference between 7-7-2000 and 07-07-2000 in xsd, and neither in the SPARQL query endpoint.

Here are the people in Wikidata born on 07-07-2000: https://w.wiki/3wrj

And here the people born on 7-7-2000: https://w.wiki/3wrk

The results are identical.

(This doesn't mean we have no duplicates at all in Wikidata - the post actually mentions five discovered duplicates within Queen Elizabeth II's ancestors. But these are entities, not within the datatypes)

IMHO WD SPARQL should reject invalid literals: https://phabricator.wikimedia.org/T253718

I missed the original HN and twitter threads referenced in the post, so I might just be repeating something that was already said there...

But, in nearly all cases I would trust a bespoke Wikipedia scraper over using the output of Wikidata or DBpedia. Not to disparage either project, because they're great ideas and good efforts. I have a firm grasp of RDF and SPARQL queries (used to work with them professionally), which also makes them tempting to use.

One issue is that Wikidata tends to only report facts whose subjects or objects themselves have articles (and thus Wikidata entities).

For example, compare the "Track listing" section of Carly Rae Jepsen's Curiosity EP on Wikipedia vs. the "track listing" property on Wikidata.

Wikipedia has:

    1. Call Me Maybe (link)
    2. Curiosity (link)
    3. Picture
    4. Talk to Me
    5. Just a Step Away
    6. Both Sides Now (link)
while Wikidata has:

    1. Call Me Maybe
    2. Curiosity
So not only has it ignored any tracks that aren't deserving of their own articles, but it also missed one that actually does have an article (track 6, a cover of "Both Sides, Now").

> Others asked about the data quality of Wikidata, and complained about the huge amount of bad data, duplicates, and the bad ontology in Wikidata (as if Wikipedia wouldn’t have these problems. I mean how do you figure out what a Wikipedia article is about? How do you get a list of all bridges or events from Wikipedia?)

Often the problem isn't that Wikipedia is wrong, it's that Wikidata's own parser (however it works) doesn't account for the many ways people format things on Wikipedia. With a bespoke parser, you can improve it over time as you encounter edge cases. With Wikidata, you can't really fix anything... the data is already extracted (right or wrong) and all the original context lost.

Scientific Articles are in a similar situation: when importing one from a bibliography database, you won't always find every author... So people made an alternative prop "author name" and some disambiguation tools that allow users to gradually replace those with "author" links to real persons.

Let's put aside the question whether each song ever written should be in WD: I believe all of this data, modeled more elaborately, is available on MusicBrainz. There's a difference between a work (eg "Both Sides Now") and its particular rendition in an album (as you said that track is "a cover"), and MusicBrainz makes that distinction and captures both, but I think WD doesn't (I don't work on music in WD, so I haven't checked).

If you really want all this data in WD then I guess you could import it from MusicBrainz... a massive undertaking.

> Wikidata's own parser (however it works)

There's no such thing (in contrast, DBpedia has the dbpedia extraction framework, which is fairly good but not perfect and suffers greatly from the various ways people use to describe the same thing). WD has tools like QS and wikibase-cli, and people write bots to scrape and contribute specific kinds of data.

BTW the Wikipedia link https://en.wikipedia.org/wiki/Both_Sides,_Now#Other_recordin... is wrong (invalid): you can't have two anchors in a link.


acute observation! there is something here to be teased out .. about.. the final product is a human readable page all these years, and that human readable page got better in adhoc ways and most all of those improvments stuck..

compare to the RDF efforts, who ride a rigorous math-y perspective and with a far, far smaller development crowd right away..

> So not only has it ignored any tracks that aren't deserving of their own articles, but it also missed one that actually does have an article (track 6, a cover of "Both Sides, Now").

In other words, "scraping wikipedia" is the answer to the question implied in the HN title to this post. :)

I'd suggest that in this case one should consider using MusicBrainz, in order to get more comprehensive and better results than either with Wikidata or Wikipedia.

I wouldn't say the data is better, just different. Instead of "how do I extract the info I want?" your problem becomes too much data to sift through. See my comment here: https://news.ycombinator.com/item?id=24992600

That obscures my point.

My point is that even if "Wikipedia" may not be the best tool, "Wikidata" is the wrong tool because the data is wrong.

Here-in lies the problem for me:

    select * { wd:Q9682 (wdt:P25|wdt:P22)* ?p . ?p wdt:P25|wdt:P22 ?q } -
I am extremely motivated to learn how to use this: I have a deep desire to extract data from Wikipedia, and I'm fascinated by graph databases.

And yet, despite trying on several previous occasions, SPARQL has completely failed to stick in my brain.

This is partly my own failing: I'm confident that if I really dedicated myself to it I could get over this hump.

But it's also a sign that the learning curve here really is tremendously steep, which I think indicates a problem with the design of the technology.

I find it helps to translate the syntax into english:

> select *

Show all variables starting with ?

> wd:Q9682

Find item Q9682 (Queen Elizabeth 2)

> (wdt:P25|wdt:P22)*

Follow edges that are either P22 (father) or P25 (mother) zero-or more times

Everytime you follow one of those edges, add the new item to ?p. Keep following these edges until you can't anymore.

> ?p wdt:P25|wdt:P22 ?q

For every ?p follow a mother/father edge precisely once, call the item it points to ?q (if there is no such edge we get rid of the p)

The end result, is we have a list of rows containing pairs of (an ancestor of elizabeth, one of that ancestor's direct parents).


I feel like one of the reasons that sparql is confusing is because people use their intuitions from SQL which is wrong - since the underlying data model is different but the syntax looks vaugely sql-like which leads to misunderstandings.

Where do you end up with the translations from wdt:P25 to "mother"? That's the most incomprehensible part. It feels like I need a reverse dictionary lookup to write a single query.

I 100% agree that namespaces, urls and numeric Q ids add significantly to how complex wikidata sparql queries are, and generally make them incomprehensible. The editor at https://query.wikidata.org does have helpful tooltips though.

But honestly i think people would have a lot easier time if we had less indirection and just wrote "mother" instead of wdt:P25

What i actually do, is take the number, if it starts with a q go to wikidata.org/wiki/Q123 . If it starts with P go to https://wikidata.org/wiki/Property:P25

What it actually means in a technical sense:

Identifiers in sparql are urls (sort of similar to integer id fields in sql). wdt: is short for http://wikidata.org/prop/direct so wdt:P25 is http://wikidata.org/prop/direct/P25 wdt: means basically normalized but there are other prefixes if you need to access deprectated statements or modifiers on properties. Gory details at https://www.mediawiki.org/wiki/Wikibase/Indexing/RDF_Dump_Fo...

Btw, if you are using https://query.wikidata.org, you can type: wdt:mother, press ctrl+space and it will suggest P25 relationship.

Yep. Not only this but in the sample "wd:Q9682" is a lie. wd is a namespace shortcut which is expended to an URI, and the prefix to URI mapping has to be defined as part of the query, otherwise it won't work. Notice how the sample use two of those prefixes (wd and wdt): data is segregated in different namespace that one have to search for and remember each time they want to make a query. And I mean remembering the prefix value, ie a partial URI, not the little cute prefix like wd that semweb sample always use.

The sparql endpoint that is typically used with wikidata has some namespaces implicitly predefined, including wdt and wd.

You do not need to declare the wd: prefix if you are using the endpoint at https://query.wikidata.org

I think its safe to assume in context that the newbie sparql user is not setting up their own sparql endpoint but using the official wikidata endpoint.

It’s part of the Wikidata data model.

It acutally looks pretty similar to Regex. But instead of matching strings of chars, we match paths on graphs?

Essentially given a big graph, sparql finds all the subgraphs that match the given constraints and project the captured variables into a table for every subgraph matched. (Or at least that's how i think about it, not sure if that's officially what it does)

The property path syntax ( https://www.w3.org/TR/sparql11-query/#propertypaths ) does look a lot like regex syntax, and the general triple pattern construct does kind of feel a bit DFA-ish.

A few days ago, the Wikidata Query Builder[0] was deployed. It provides a visual interface to generate simple SPARQL queries, and you can show the generated queries. Maybe this can help you in understanding how SPARQL patterns work?

[0] https://query.wikidata.org/querybuilder/

That does look like a big step forward.

It could really benefit from some linked examples on that page though - I stared at the interface for quite a while, unable to figure out how to use it for anything - then I dug around for an example link and it started to make sense to me.


Or use something like https://github.com/zverok/wikipedia_ql that uses Mediawiki API

I think this project is not getting enough attention: https://github.com/zverok/wikipedia_ql

It allows to query Wikipedia (not wikidata, but the actual human-readable text) more or less directly, mixing the way you describe a scraper with some nicer higher-level constructs.

Can't vouch for its performance, but the API is interesting and nice.

My personal long-standing wish is querying categories, in which the pages have the same infoboxes, by the fields in the boxes. Preferably without waiting to download dozens or hundreds of pages first.

The infobox→Wikidata integration would pretty much solve that (not the other way around), and I'm told that the Wikidata Bridge project aims to do that integration: https://www.mediawiki.org/wiki/Wikidata_Bridge

However, if someone made a different database that would be queryable tomorrow, I wouldn't mind.

Thanks, but this sounds like it only says what subcategories and, perhaps, pages are in the categories—but doesn't contain any data from the pages themselves. My main target is kinda-structured data from infoboxes—e.g. genre, platform, year for videogames. I don't even need categories particularly—I just grab all pages from them, hoping that all the pages I would want are in these categories.

Sounds a lot like DBpedia to me

Hmmmm, indeed. Considering that I've heard of DBpedia before my attempts with Wikidata, I now wonder why I didn't use it. Gonna check what they know about subjects that interest me.

On a related point, while doing some Unicode research, I discovered that the Unicode project itself uses wikidata as an (untrusted) source for some data, translations of names, if I recall correctly, cf. https://www.unicode.org/review/pri408/pri408-tr51-QID.html although that's not the reference I encountered earlier today. Their system is set up so that if the Unicode organization corrects something previously read, it takes precedence over what was pulled from wikidata, but otherwise the wikidata value will be used.

Ah, this is what I read earlier today, yay Google+color-changing links http://cldr.unicode.org/implementers-faq

Wikipedia asks people not to crawl it. There are database dumps that you can instead import into your local MySQL and work from there.


Wikipedia has no objection to crawling a couple thousand pages if you do so at a reasonable speed and set a user-agent with a contact email.

If you want to crawl millions of pages please use a dump.

I'm all for Wikidata, it's great in some ‘high-profile’ cases like data on countries—at least by my moderate standards. I didn't have much problem with Sparql, or perhaps my queries were simple. However, once you get into the lowbrow territory of e.g. modern cultural artifacts, people just edit Wikipedia way more, end of story. Want to know what games of some genre were made for platforms of your choice, sorted by year? You go to Wikipedia, not Wikidata.

I'm told that there's a project to integrate infoboxes with Wikidata, so that their info goes into Wikidata when edited (not the other way around)—which would solve a large part of this scarcity, if the integration is seamless enough. Haven't yet seen it in action. Here's the project, Wikidata Bridge: https://www.mediawiki.org/wiki/Wikidata_Bridge

It’s fairly difficult from the other side as well - contributing. I’ve been trying to complete wikidata from a few open source datasets I am intensly familiar with… and it’s been rather painful. WD is the sole place I have ever interacted with that uses RDF, so I always forget the little syntax I learned last time around. I have some pre-existing queries versioned, because I’ll never be able to write them again. I even went to a local Wikimedia training to get acquainted with some necessary tooling, but I’m still super unproductive compared to e.g. SQL.

It’s sad, really, I’d love to contribute more, but the whole data model is so clunky to work with.

That being said, I now remember I stopped contributing for a slightly different reason. While I tried to fill WD with complete information about a given subject, this was never leverage by a Wikimedia project - there is certain resistance to generating Wikipedia articles/infoboxes from Wikidata, so you're fighting on two fronts, you always have to edit things in two places and it's just a waste of everyone's time.

Unless the attitude becomes "all facts in infoboxes and most tables come from WD", the two "datasets" will continue diverging. That is obviously more easily said than done, because relying on WD makes Wikipedia contribution a lot more difficult... and that pretty much defeats its purpose.

> the two "datasets" will continue diverging.

You may be pleased to learn that there is a project underway that aims to largely solve that problem:


The last piece of news I can immediately find is that it was deployed to the Catalan Wikipedia in August 2020, but I'm not sure what progress there has been since.

I have no problems with the data model, but sadly you can't insert RDF statements: you have to go through tools like QS and wikidata-cli and the WD update performance is dismal.

See https://phabricator.wikimedia.org/T290061, which I posted in https://phabricator.wikimedia.org/project/board/5504/ for DataQualityDays2021

How do other people use Wikidata dumps if they are not using the "official" (with sparql or so) way of querying it? I have done some pretty raw extraction from it (e.g. download the already pretty large zipped json dump, then unzip it on the fly and parse the json, and extract triples and entities). Not sure if that is really quite efficient, but the dumps are hard to work with, and I really just needed the entities in one language and the triples/graph of them.

This tool https://wdumps.toolforge.org/ allows you to create bespoke dumps that are pre-filtered for your needs.

They release a json dump but also just a dump of true triples in nt format (basically just tsv). It's few tens of gigs uncompressed

I'd highly recommend parsing the mobile edition of Wikipedia, it is much easier to parse.

My attention was caught by "Albert The Bear" (second list).

It reminded me of Albert And The Lion: https://www.youtube.com/watch?v=oaw-savyK0s

For me, I ended up using both. Opting for wikidata wherever made sense, but a lot of things felt half built/broken.

for me, as an occasional wikipedia editor, wikidata is just the annoying thing I need to deal with when linking different language wikis together.

I can never figure out anything there, but it’s just the UX I guess

Wikidata has failed. This is not their fault, we have know for 10+ years machine readable data fails. See tags, and their failure (hashtag != tag)

Their example of a goat, a simple common animal, does not identify how many teats a goat has ( does it? )



Everyone starting does family trees. Because they are easy and easily defined. But even this articles query ends up at fictional characters.

The "period of lactation" for a goat is not a number. And it's not even one graph. It's multiple graphs which we don't have the data to accurately know.

The original article was 100% correct. Web-scraping was the way to get the data. Web-scraping is a very useful and transferable skill. There's no point learning skills on a known failed idea like machine readable data.

Wikipedia allows web scraping, anyone who tells you different is lying, see their robots.txt to make sure you don't get rate limited if doing massive amounts https://en.wikipedia.org/robots.txt (and to see the stuff they don't want you to read). They also have downloadable dumps you can use.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact