
Transforming Wikipedia into a cultural knowledge quiz - crazygringo
https://medium.com/@mjbaldwin/transforming-wikipedia-into-an-accurate-cultural-knowledge-quiz-b0a0f74877c#hn
======
testplzignore
> Digging in, it turned out “Apple” belonged to the category Steve Jobs which
> eventually belonged to… “People,” of course. It turned out Wikipedia
> categories aren’t strictly hierarchical at all, but are used for so many
> “related” things as to make them useless for determining what kind of a
> thing an article represented.

This feels like a problem worth solving on Wikipedia itself. It would be nice
if categories could be marked as non-hierarchical, so that for a given
category, you could know whether its articles could be classified under all of
the ancestor categories.

[https://en.wikipedia.org/wiki/Category:Eponymous_categories](https://en.wikipedia.org/wiki/Category:Eponymous_categories)
would be a good place to start. Probably most of those are not hierarchical.

~~~
tyingq
Wikidata is structured. Here's the entry for Apple:
[https://m.wikidata.org/wiki/Q312](https://m.wikidata.org/wiki/Q312)

~~~
yorwba
Unfortunately, Wikidata has a lot less structured data than the semi-
structured data you can get out of Wikipedia. A database like Wikidata simply
doesn't get the same broad user base as Wikipedia, and therefore also fewer
contributions.

~~~
jpatokal
Wikidata is one of the backends of Wikipedia and much of its information has
been extracted from there.

------
thanatropism
The next step is to make crossword puzzles from this.

(I'm next to illiterate about constraint satisfaction programming. How hard is
to make reasonable crossword puzzles?)

~~~
theoh
The constraint programming part isn't the bit that makes the difference
between a good and bad crossword. The difference is in the quality of the
clues. Generating plausible 'cryptic' clues is probably well beyond the
ability of current AI.

If you wanted 'Jeopardy'-style clues, that's easier.

------
Udik
Like others, I did the test and was rather disappointed at my score :). And
yes, there seem to be a lot of rappers and Bollywood stars and movies in the
quiz, that don't really appeal to my European sense of "culture". I wonder if
instead of (or better, in addition to) page popularity it wouldn't be wise to
use the _number of translations_ of an entry in other languages. That should
at least ensure that an item is considered important across local cultures-
which is usually a good indicator of cultural importance. Did you try that?

------
JauntyHatAngle
Isn't it a bit much to say "Scientifically Accurate" when more or less people
are just checking boxes? My feeling would be that people are massively over
representing their own knowledge.

------
crazygringo
Author/creator here, happy to answer any questions.

~~~
psychometry
Wouldn't it make more sense to ask country first and then generate examples?
I'm from the U.S. and 10% of my examples were Bollywood actors.

~~~
crazygringo
I'd love to do exactly that, but right now I get my popularity stats from
English-language Wikipedia pageviews, which is majority-influenced by the US
but also has significant traffic from India and the UK.

If Wikipedia ever breaks down its pageview stats by country, so I can generate
different quizzes per-country, that's the first thing I'll do!

~~~
Udik
Not to nag you, but have you seen my suggestion of using the number of
translations of each entry as an indicator of their cross-cultural value? What
do you think if it?

------
rcMgD2BwE72F
Why use Wikipedia infoboxes instead of Wikidata items?

~~~
crazygringo
I would have loved to use Wikidata -- I actually first attempted a prototype
of this several years ago using the similar Freebase as a datasource, until it
was bought and shut down by Google.

Wikidata looked very promising, but I was worried if it would contain _all_
the data I would wind up needing, or if it would be in the same format 2 or 5
years from now. Wikipedia is a household name and the information in it has a
lot of eyeballs on it constantly, while Wikidata as a project I couldn't tell
if I could be equally confident in -- so really just taking a conservative
approach is the only reason.

~~~
ReverseCold
Wikidata is part of Wikimedia, which is the same org that controls Wikipedia.

~~~
afandian
Wikimedia 'controls' the software and infrastructure, but is relatively hands-
off about the content. There is, shall we say, some give-and-take between
Wikimedia and the various projects' communities.

The communities overlap between Wikipedia and Wikidata to a large extent, but
are distinct.

------
sandov
It's worth pointing out that "Your culture" here means "first world, English
speaking countries' culture".

~~~
telesilla
More than that - pop culture. I hardly know any rappers or RnB singers in the
21st century but I can name, for a start, a number of contemporary
philosophers, computer languages and their creators, artists and composers. I
don't think I'm in the minority for not knowing pop culture? Still, impressive
project from the author and I understand that the result is ultimately pop-
culture skewed, given the restrictions.

~~~
baddox
That’s true, but this is culture as defined by Internet traffic, which is
pretty reasonable. Everyone knows a lot about the aspects of culture they care
about, by definition, so it doesn’t make much sense to have a test for that.

------
dang
Recent discussion:
[https://news.ycombinator.com/item?id=18175910](https://news.ycombinator.com/item?id=18175910)

------
kevinwang
Shouldn't the logistic regression picture have a picture that looks more like
this instead of a line?
[https://en.m.wikipedia.org/wiki/Logistic_regression#/media/F...](https://en.m.wikipedia.org/wiki/Logistic_regression#/media/File%3AExam_pass_logistic_curve.jpeg).
Or is the curve just really flat?

~~~
crazygringo
You're right, I mixed up the terms while writing this. It doesn't use the
logistic function, it's a more general case of binomial regression [1]. (The
example is a line, but the site actually uses a logarithmic function as its
link function.) Just corrected the post, thanks.

[1]
[https://en.wikipedia.org/wiki/Binomial_regression](https://en.wikipedia.org/wiki/Binomial_regression)

------
rahulcap
Very interesting article. In the end, I was a bit confused on how you
converted the binomial regression to a single number. I understood that the
output was a probability that I know each of the 10,000 items, so then did you
need to use some cutoff to decide that I "knew it"?

Anyways, I am interested to see what analysis you do after you get more data.

~~~
crazygringo
Thanks for the interest -- it's actually just a sum of the probabilities for
the items from 1 to 10,000. For example, if there's a 0.1 chance you know each
of 10 items, it adds up to a total value of 1 -- no cutoff needed.

Mathematically, there's a trick where you don't even need to compute the sum
item-by-item... I calculate the binomial regression which gives me the two
relevant parameters, from which I can calculate the probability density
function (PDF) [1] for an item of given rank. Then I just calculate the
associated cumulative distribution function (CDF) with the same two parameters
[2] for rank 10,000 -- and that's the final result.

[1]
[https://en.wikipedia.org/wiki/Probability_density_function](https://en.wikipedia.org/wiki/Probability_density_function)

[2]
[https://en.wikipedia.org/wiki/Cumulative_distribution_functi...](https://en.wikipedia.org/wiki/Cumulative_distribution_function)

------
arielbaz
Is combosaurus one of the 10,000? [https://www.quora.com/Whatever-happened-to-
Combosaurus](https://www.quora.com/Whatever-happened-to-Combosaurus)

------
dpatrick86
Following the instructions pedantically (e.g. emphasizing "uniquely identify,"
especially for the individuals) probably leads to an enormous biasing of the
score.

------
personjerry
How did you drive traffic to your quiz site? I feel like this will affect
significantly your results.

------
jordiburgos
How the items are selected ?

------
asianthrowaway
I guess I'm not cultured for not knowing about all the rappers and actors who
seem to make up 90% of the list.

I did have a good laugh at Jared Kushner being listed as an "investor".

~~~
jonwachob91
His money is in real estate investments. Might not be tech VC, but an investor
is still an investor.

