Hacker News new | past | comments | ask | show | jobs | submit login
The impact of Prince’s death on Wikipedia (wikimedia.org)
253 points by The_ed17 on Apr 23, 2016 | hide | past | web | favorite | 101 comments



For others who were left scratching their heads at what exactly this pop-sci-explained PoolCounter mechanism actually is:

https://wikitech.wikimedia.org/wiki/PoolCounter

TL;DR:

It's a limiter on how many workers start rendering the new page version when the old page version in cache has been invalidated.


Yeah, I am the engineer mentioned in the article, and I agree the explanation doesn't really work. The pieces of the explanation that ended up in the article itself don't add up to a coherent explanation. The fault for that is mostly mine. In hindsight, my original explanation was too long and too elaborate to be helpful. It's a good reminder that it is easy to go to far with an analogy and end up complicating the thing you were trying to simplify. Oh well, live and learn :)

There are more (coherent and to-the-point) details about PoolCounter in the prologue to PoolCounter.php in MediaWiki's source tree:

https://github.com/wikimedia/mediawiki/blob/1617e7822eaf7426...

And in a short blog post by Domas Mituzas, who is the original author of PoolCounter:

https://dom.as/2009/06/26/embarrassment/


There is a stochastic approach that can be adapted to address this problem, I think I first saw it at IMVU in 2009 but conveniently Wikipedia has a good reference now: https://en.wikipedia.org/wiki/Cache_stampede#Probabilistic_e...

The advantage is less coordination is necessary and you should be able to get down to a single concurrent rerender per page.


Wow, that's a nice technique, but in this case, isn't the rendered page invalidated by changes to the source, rather than over time?

I suppose, in this case you could use the time since invalidation as your input. The downside is that changes in the source aren't immediately reflected in the rendered output, especially for infrequently updated pages.


Wow, fascinating -- thank you for the pointer!


Basically a page view checks the cache and rebuilds if necessary. Thousands of page hits in the same second before the build is over starts thousands of parallel rebuilds.


Also known as the thundering herd...


Yeah I never really solved it while I was using memcached but I'm not Wikipedia


The way i've seen it done is something like this:

* Cache never automatically expires, but you do have some notion of staleness * Whenever you request data, you get it from the cache (if it exists), and check for staleness, so you cached data needs to know when it was cached. * Return the data as usual, but if the data was also stale, you fire off a worker to update the cached data. * If you have lots of requests happening at the same time, you have a system for seeing if a worker already exists, to ensure that you only create one (for each piece of cached data). * For the time it takes for the worker to complete, you have to be okay with serving stale data, in most cases this is okay.

There's an edge case missed here, which is what to do when the cache is empty (either because it's one of the first requests, or because the cached data has been evicted). That's up to you, depending on your use case. You can basically either return a default value, you can pre-warm your cache, or you can let the requests hang until the data is ready.


I should've fixed the formatting whilst it was still editable :(

A former colleague of mine built a Django implementation of this pattern, which is pretty useful: https://github.com/codeinthehole/django-cacheback


I never looked into whether memcached will give you the creation date, I just let memcached expire it itself


Why isn't it a possibility to rebuild and invalidate the cache only when the rebuild is finished?


Because it's usually triggered with the cache expiring then it's not available and every page view attempts rebuilds. Otherwise the cache and time will be separate. I store the cache in memcached with an expiration to detect it expiring


You could queue an async job to rebuild what was stored in the cache right before the expiration time when you update / renew the expiration on the cached item?


+++ath0


+1 for whoever named it the "Michael Jackson problem". I'm going to start using that if I ever encounter the same issue myself.


Is this not the thundering herd problem?


> Is this not the thundering herd problem?

There's a wikipedia article on the thundering herd problem.

https://en.wikipedia.org/wiki/Thundering_herd_problem

The article needs a little love, though.


As I understand it yes, but I think the "Micheal Jackson problem" is a better name.


It's a cute name but there is existing Research that relates to this issue which calls it "the thundering herd problem" so I would recommend the one you can actually search for without problems.


Who?



American Idol caused widespread rearchitecting of SMSGW/SMSC. Fun times; voting was a very distributed DoS.


Interesting how in the graph it looks like some people found out about 25 minutes before it was more publicly found out.


16:49 UTC refers to when TMZ first reported Prince died. Prior to that, there had been plenty of reports that there were ambulances at Paisley Park. It is not strange that people would google either "paisley park" or "prince" and then follow the results to Prince's wiki page.


I don't know the exact timeline, but my understanding is that there were reports of a death in the area where Prince lived before it was known that it was Prince.


I found a few references to stories about a 'medical situation' at Paisley Park that came out before TMZ's report—I'm assuming it was those reports.


I wonder if there's a crossover between celebrity reporters, and people who update Wikipedia on celebrities?


I'm seeing an exponential curve, isn't it what we should expect ?


Look very closely at the section of the graph starting around 4:20 PM. There's a small but significant increase in hits before the big spike starts around 4:50.

It's easier to see on the full resolution graph:

https://upload.wikimedia.org/wikipedia/commons/f/f2/Prince_a...


How are Wikipedia articles kept consistent with each other? Say someone like Prince dies. His page will instantly change, seemingly while his portrait is still in the sky and the cannon fires.

But with certain people there's a variety of connected items that need referential integrity. For instance, I can imagine Prince being on one of those lists (eg highest grossing) that has bold text for still living artist. For office holders, they need to be moved from "incumbent" to a box with dates and the new incumbent needs to be updated. And then there's text snippets that are in present tense ("Prince and David Bowie are among the greatest living artists").

And then there's the corresponding pages in other languages.

How's it done?


> How are Wikipedia articles kept consistent with each other?

They aren't. There is no transactional/referential integrity on Wikipedia. When someone famous dies a pretty common pattern is that first the death date is added, and then in a later subsequent edit someone gets around to changing the present tense verbs into past tense verbs ("Prince is a singer..." --> "Prince was a singer...").

I can tell you all about how category normalization is maintained, though. It starts with this process: https://en.wikipedia.org/wiki/Wikipedia:Categories_for_discu...


Some material like that is generated by templates, so updating the templates updates it in many places.

A lot, however, is just done manually by the armies of volunteers that contribute to the site. More than a few people specialize in updating bits of minutia like that.


I don't think WMF staff are credited enough for the work they do in keeping Wikipedia running. They seriously know how to scale, I think the only ones better than them are honestly Facebook and Twitter!


I don't know if I would go that far, but WMF also don't have nearly the resources of Facebook or Google or Amazon, and at their scale nothing is easy anymore.


Google as well?


You know, Google are so much in my life I don't even notice them!

Google beats everyone :-) they are so far ahead that I didn't even consider mentioning them!


Can't caching a page with varnish and memcache handle this?


They cache the bejeezus out of their pages. Problems come up when a lot of people want to edit a page, an inherently uncacheable operation.


It's possible to use stacks to 'cache' writes in scenarios like this.

Writes to the same object go in the same stack, iterate over stacks, pop the first item, write it, clear the stack.

It works miracles for ephemeral data like wikipedia edits.

If you have extremely spikey load on servers, stacks are also a great replacement for queues, admit that during the deluge some portion of queries will timeout and go unanswered, instead of trying to process queries that are likely to timeout, simply process the first query on the stack and don't waste time processing the ones bound to fail.


Doesn't caching writes from multiple different servers potentially cause consistency and durability concerns?

I know MongoDB still haven't marked the bug [1] reported by Kyle Kingsbury [2] that found stale reads on all consistency and write concern levels...

1. https://jira.mongodb.org/plugins/servlet/mobile#issue/SERVER...

2. https://aphyr.com/posts/322-jepsen-mongodb-stale-reads


Yes, but this is wikipedia, the entire premise of it is eventual consistency. The idea being that the most recent update to a page is the correct one.


Good point.


Varnish does handle the mass of logged out cached requests. However, because they were having such a high amount of page edits per second the cache in Varnish would only by valid for about a second. Then a flood of logged out users hit the servers at the same requesting the uncached page to be rendered. The PoolCounter extensions keeps the web servers under control and by throttling requests for page rendering.


Why not just set the min TTL to a few minutes, or even a minute, for anonymous users? Is there enough usefulness in ~seconds of delay on article edits vs minutes to warrant a much more complex design?


Correct me if I am wrong, but I thought Varnish has support for limiting concurrent backend fetches to the same resource.


Across a cluster of varnishes? I think that limit is per varnish.


No, definitely not across a cluster (although that would be quite nifty). Even on a single node that would reduce the thundering herd effect substantially.


Maybe that's a new feature request!


So... did you read the article?


This is so impressive, to see behind the curtains of what has become the central repository of humanities knowledge, during a moment of loss of one of humanity's greats.


Take note Robots at Facebook and Google. Monetizing humanities information through ads is not the only way.


> the central repository of humanities knowledge

luckily, knowledge is decentralized


> during a moment of loss of one of humanity's greats.

Compared to, lets say Bill Gates who's saved millions of lives?

Even artistically, I'm not sure Prince was up there in the top 1%

The power of marketing.....


Prince has provided millions (probably billions) of fans with moments of joy in their lives. The album "Purple Rain" pretty much was the bookmark of my freshman year at college. So many memories of that first year away from home come rushing back whenever I hear any of the songs from that album. I can vividly recall specific events, settings, and people for almost all of them, over three decades later.

I never saw Prince live, but talking to people who did, he commanded the stage and the audience like few others. Without exception they describe him as one of the best live performers they have ever seen.

I don't see why you bring up Bill Gates. His philanthropy is laudable and that stands on its own. I don't see how recognizing that Prince is probably up there with the best performing artists in human history takes anything away from that.


William Shakespear didn't saved as many lives as any physician of the time. Yet here we are, 400 years after his death, commemorating the man and his work.

Different people contribute differently to the betterment of mankind. That doesn't mean some contributors need to be censored away from popular culture just because you perceive their contributions to be not as important as others'.


I agree with your point, though I think many physicians at the time of Shakespeare may actually have had a negative score when it came to saving lives.

http://historyworld.net/wrldhis/PlainTextHistoriesResponsive...


Just lol. Blind comment. He played and produced his first album his self, all 27 instruments, all production, composition, arrangement. Pretty sure he was under 20 also. Sign o the Times, lyrically hits you like Bob Dylan. I'm not sure what makes an artist to you but Prince created some of the best art I've heard. But yes, beauty is in the eye of the beholder.


If you're serious, do you genuinely believe you know what you're talking about?


I totally agree with the sentiment of this comment (though I'm not sure how much marketing is involved.) Prince was very good at music and was, well, kind of a dick. I don't quite get the massive outpouring of grief that has ensued.


"Kind of a dick" I think could easily apply to Gates as well, no?


Well, I wasn't really responding to the Bill Gates part in particular. Gates was a bit of a megalomaniac with MS. But he's doing awesome stuff with the money now. As person I'm not aware of him ever being a jerk.


So if you gain lots of money via semi nefarious means what fraction do you have to dedicate to good works before the earlier wrong is cancelled?

I know that he didn't gas 6 million people but letting people buy their way out of moral debt with a fraction of the money they gained still seems horribly repugnant.


First, what moral debt? Second, I imagine the sum total of his humanitarian efforts are greater than the total charity if all of those dollars remained in the pockets of each person who bought windows 95 et al. So repugnant seems like a real stretch.


They mentioned 5M views within 24 hours of Michael Jackson's death. With over 3B Internet users out there, I am actually a little surprised how small the spike was. Did they only count English Wikipedia? Even so I am quite surprised. I would expect 10-20M at least. Similarly, many young people like myself have never heard of Prince, I had to look him up to find out who he truly was.


They recently overhauled it, [0] but back then the pageviews [1] wouldn't count mobile users. You can see the old stats for Jackon's page here: http://stats.grok.se/en/200906/Michael%20Jackson. The actual number is 5,875,404 views within that day (in whichever time zone) and is for the English version of the article specifically.

[0]: https://blog.wikimedia.org/2015/12/14/pageview-data-easily-a... [1]: https://en.wikipedia.org/wiki/Wikipedia:Pageview_statistics


Thanks. Yeah, I think mobile viewer would be a substainal amount, but desktop user count is still below my personal expectation. We are talking about 850M English speaking Internet users :( only 6M page view within 24 hours is really quite low.


As an "old" person (I'm 32) it blows my mind that younger generations might not be familiar with Prince's work. Unfortunately these "damn, I'm old!" moments keep cropping up more and more just lately. ;)


Just out of interest, did you go directly to Wikipedia to find the information, or did you go to Google, which then led you to Wikipedia?


Always Google, which leads me to Wikipedia. Almost literally every time...


this is what was going on on twitter at this time: http://tweetfortat.net/timeploteventsTWAPPERKEEPERjacko.php5


What happened at 7:15?


End of a news broadcast maybe (end of the 10 o'clock news on the west coast would work I think)


Given how short the spike is, I wonder if it's possibly due to a misconfiguration somewhere and a TZ offset is either getting misapplied or mis-corrected for?


Even in peak it's just ~800 hits per second - it shows how is irrelevant the C10k problem (yes, I know it's not exactly about hits per second, but still).


The C10k problem is absolutely not irrelevant, and certainly not because one page only saw 800rps during a worldwide event. Two things there:

(1) 800rps TO THAT PAGE is the metric. The entire rest of Wikipedia was still getting traffic, and as an educated guess I would estimate raw traffic to be on the order of magnitude of 3-4krps (across editing and views). They are quite open with operations and if I weren't mobile I could probably find the accurate answer.

(2) There are much higher traffic properties. I'm aware of one property beyond 200krps in aggregate.

If you had said "most people don't have to worry about C10k," then I'm on board. That's true. Irrelevant? Far from it.

And yes, query rate and connection count have a complicated relationship. You need three or four other metrics to explain their relationship, but raw query rate is a good yardstick for active connections when combined with quantiled request latency. (Not averaged.) Simple example: a 750ms 95th% page hit 10,000 times per second is almost certainly far > C10k because of the outliers.

Now I will grant that C10k itself is somewhat irrelevant, yes, but not for the reason you are saying. It was defined in an age when 10,000 active connections was pretty surprising (didn't it come from FTP or some other heavy eyeball protocol?). These days with long poll apps, long-running protocols, and so on, millions of open connections are quite common at consumer scale. I find C1M far more interesting these days. C10M is still kinda nuts, but does exist in the magical world of metal and fiber and hot aisles and all those great things that nobody uses anymore (depressingly).


https://grafana.wikimedia.org/dashboard/db/varnish-http-erro... (and grafana.wikimedia.org in general) have more stats. It stated '13.38 Million req/min), which I think is ~230,000 req/s


Ah. So my hunch was that the published number was what is making it through cache, and that's where my estimate comes from too. That sounds about what I'd expect for the cached side.

Nice find!


The total number of requests that get through the cache to the application layer can be seen here

https://ganglia.wikimedia.org/latest/stacked.php?m=ap_rps&c=...

which you can see is not showing any substantial increase due to the passing of Prince. The big hole of two days that ends just before the news broke is due to wikimedia switching traffic to a second datacenter for two days, see http://blog.wikimedia.org/2016/04/18/wikimedia-server-switch...


10k connections case is related to 10k hits case, because 10k hits is 10k connections, just not long in time. And what I tried to say: "if even so popular page of Wikipedia doesn't have 10k hits problem, for 99.99% of projects it will be corner case, not everyday routine". Sorry if it was worded wrong.


C10k doesn't come up with Wikipedia because it doesn't do websockets. If it did it would probably have millions of concurrent connections, far over 10k. I dare say 10k is a bit passe. Whatsapp can apparently do 2m/server.


2m per second? These guys must have squeezed every ounce of performance out of Beam.


2m refers to concurrent connections, not rate.


Ah, that makes sense. Still, is 2m rps within the realm of possibility for BEAM with a high-end server grade processor and heavy use of actors/concurrency?


The Phoenix framework tests in October got up to 2m concurrent active (well, mostly just awake) connections, without timeouts.

http://www.phoenixframework.org/blog/the-road-to-2-million-w...


Finally a replacement for "Site got slashdotted": "Site got Prince'd". I like it.


"He was ... known for, among many other things, ... a performance at Super Bowl XLI in a raining downpour in front of over a hundred million people."

Typo and/or I call bullshit.


This[1] seems to indicate viewership of the halftime show peaked at 140M.

1: https://web.archive.org/web/20090412054158/http://www.suntim...


Okay, I can see where you're coming from but the sentence was "raining downpour in front of over a hundred million people." How does that NOT imply a live audience of a hundred million? I can't believe people are downvoting me for this. The sentence is absurd as written.


Downvotes aren't punishment. Up/down votes indicate "this comment ought to be more/less prominent on the page". The sentence was somewhat ambiguous and your misunderstanding is understandable. But your comment is not useful to the discussion because most people understood and in any case it's tangential.


Okay honestly am I the only one who reads "in front of a hundred million people" and thinks it actually means he was in front of a hundred million people? I don't see it as an ambiguity, it's simply a wrong statement. I'm okay with people being uninterested in (and even downvoting) a correction, but I'm baffled that anyone could read the sentence as anything other than incorrect. I mean, they even describe the weather of the event as if to make it sound even more impressive that all these people showed up!


I think they mentioned the weather because it's relevant to his reputation as a legendary live performer and it's part of a great story. [1] There's something almost magical about incredible live performances. I can't really describe it, but sometimes a show has the perfect mix of emotion and artistry and it's completely mesmerizing to be a part of, even as a spectator. Participating in that moment bonds millions of people together in a small way. Even if they have nothing else in common, they were able to experience that together. Prince's Superbowl performance was one of those moments.

[1] http://www.maxim.com/entertainment/prince-2007-super-bowl-pe...


It's a poorly written sentence, but it's also fairly clear what they meant to say. It's polite to give people the benefit of the doubt.


I've reworded the sentence in the post—does it work better that way? Many thanks for the feedback, everyone. :-)


Thanks, I really do appreciate it. The thought that it could have meant televised audience never crossed my mind and surely wouldn't have for some other portion of the audience as well. I was out googling for largest concert size records to make sure I wasn't crazy.


during the set he performed purple rain, probably his most famous song, and it was pouring down rain. that's why the weather is notable.

a hundred million people in one venue is certainly an unbelievable number though.


Maybe because it's not the best time to nitpick about that?


ahah, always keep a scientist mind, right? that's why you downvoted :D keep it up, scientist :)


That's actually spot on if you count the TV viewing audience - 112M people as of the 2014 event.


That presumably includes the television audience.


I would assume they are including the tv audience as well.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: