This may finally be a solution for scraping wikipedia and turning it into struct...

dragonwriter · on March 25, 2023

> Do we even need structured data in the post-AI age?

When we get to the post-AI age, we can worry about that. In the early LLM age, where context space is fairly limited, structured data can be selectively retrieved more easily, making better use of context space.

ZeroGravitas · on March 25, 2023

You might find this meets many needs:

https://query.wikidata.org/querybuilder/

edit: I tried asking ChatGPT to write SPARQL queries, but the Q123 notation used by Wikidata seems to confuse it. I asked for winners of the Man Booker Prize and it gave me code that was used the Q id for the band Slayer instead of the Booker Prize.

LeonardoTolstoy · on March 25, 2023

I use wikidata a lot for movie stuff. Ideally I imagine the wiki foundation itself will be looking into using LLMs to help parse their own data and convert it into wikidata content (or confirm it, or keep it up to date, etc.)

Wikidata is incredibly useful for things that I would considered valuable (e.g. the tMDb link for a movie) but due to the curation imposed upon Wikipedia itself isn't typically available for very many pages. An LLM won't help with that but another bit of information like where films are set would be a perfect candidate for an LLM to try and determine and fill in automatically with a flag for manual confirmation.

worldsayshi · on March 25, 2023

To be fair, I was quite confused by wikidata query notation when I tried it as well.

rjh29 · on March 26, 2023

I used that when building a database of Japanese names, but found that even wikidata is inconsistent in the format/structure of its data, as it's contributed by a variety of automated and human sources!

riku_iki · on March 25, 2023

its wikidata, not wikipedia, they are two disjoint datasets.

ZeroGravitas · on March 25, 2023

Basically every wikipedia page (across languages) is linked to wikidata, and some infoboxes are generated directly from wikidata, so they're seperate, but overlapping and increasingly so.

https://en.wikipedia.org/wiki/Category:Articles_with_infobox...

edit: slightly wider scope category pointing to pages using wikidata in different ways:

https://en.wikipedia.org/wiki/Category:Wikipedia_categories_...

riku_iki · on March 25, 2023

I agree there is strong overlap between entities, and also infobox values, but both wikidata and wikipedia has many more disjoint datapoints: many tables, factual statements in wikipedia which are not in wikidata, and many statements in wikidata which are not in wikipedia.

telotortium · on March 25, 2023

> do we even need structured data in the post-AI age?

Even humans benefit quite a bit from structured data, I don't see why AIs would be any different, even if the AIs take over some of the generation of structured data.

tomberin · on March 25, 2023

FWIW, That's been my use case, when I saw the author post his initial examples pulling data from Wikipedia pages I dropped my cobbled together scripts and started using the tool via CLI & jq.

nico · on March 26, 2023

I wonder if wikimedia is going to offer free AI to everyone. Like the free/open version or ChatGPT.

By the way, NASA and NSF put out a request for proposals for an open AI network/protocol.

illiarian · on March 25, 2023

You might be interested in https://github.com/zverok/wikipedia_ql

w3454 · on March 25, 2023

What's wild is that the markup for Wikipedia is not that crazy compared to Wiktionary, which has a different format for every single language.

rjh29 · on March 25, 2023

Yeah I've tried to parse it for Japanese and even there it's so inconsistent (human-written) that the effort required is crazy.