Hacker News new | past | comments | ask | show | jobs | submit login

> there’s unfortunately no easy way to sort out the least viewed pages, short of a very slow linear search for the needle in the haystack

So this sentence made me wonder why he didn't actually just go do it, 6M pages isn't really all that big of a data set. Turns out that it's a problem of how the data is arranged. The raw files are divided by year/month/day/second[0], and then each of those seconds is a zipped file of about 500MB in size, where pages are listed like this:

en.m Alcibiades_(character) 1 0

en.m Alcibiades_DeBlanc 2 0

en.m Alcibiades_the_Schoolboy 1 0

en.m Alcide_De_Gasperi 2 0

en.m Alcide_Herveaux 1 0

en.m Alcide_Laurin 1 0

en.m Alcide_de_Gasperi 1 0

en.m Alcides_Escobar 3 0

en.m Alcimus_(mythology) 1 0

with en.m getting a separate listing from the desktop em, and all the other country codes getting their own listings too. So just collating the data would be a huge job.

The API also doesn't offer a specific list of all pages by time, so you'd have to go and make a separate call for each of the 6M pages for a given year and then collate that data too.

[0]for example, https://dumps.wikimedia.org/other/pageviews/2019/2019-01/




MapReduce and dump everything into something like DuckDB.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: