Hacker News new | past | comments | ask | show | jobs | submit login

Wikimedia's infrastructure is radically different than most FAANG.

In large part because 99% (+/-) of their traffic is read only. While Facebook and Google have to do heavy workloads for every click and action taken on their services, Wikimedia can cache basically everything. Allowing them to operate on a tiny fraction of the number of machines (and infrastructure) that the rest of the players do.




They also have looser latency SLAs. The only hard requirement is that a user can read back their own writes, but it’s okay if other users are served stale data for a few seconds or minutes even. This makes cache invalidation, one of the most notoriously difficult and expensive operations at large scale, much much easier.


Facebook also has a similar SLA. I've heard that at one point in their architecture (~2010), they literally stored the user's own writes in memcached and then merged them back into the page when rendered. You would see a page consistent with your actions, but if you logged into Facebook as any of your friends your updates might not show up until replication lag passed.


Close, IIRC we cached the fact you had just done a write, and a subsequent read request that arrived on the replica region was then proxied to the primary region instead of serviced locally.


My memory is fuzzy now but this dates back to when there were only two datacenter regions and one of them held all the primary DBs (2011 or so). All write endpoints were served in that region, so if a user routed to the secondary region did a write the request was proxied to the primary region. After doing a write a cookie was set for the user in question which caused any future reads to be proxied to the primary region for a few seconds while the DB replication stream (upon which cache invalidation was piggybacked) caught up, because if they went to the secondary region memcached was now stale.

It hasn’t been this way since around 2013 but again I am fuzzy on how. I think that’s when most such data was switched to TAO, which has local read what you wrote consistency. As long as users landed in the same cluster (and thus TAO cluster) what they wrote was visible to them, even if the DB write hadn’t yet replicated to their region.

FlightTracker postdates my time at FB (ended 2018ish) so I’m not sure how that is used. These systems evolved a lot over time as requirements changed.

I don’t remember anything about writes being batched in memcached and merged in on page load.


Dirty bits at scale


I’m guessing this is ecc memory so likely correcting for bad data.


I think they meant dirty bit as in “a flag that means update needed,” not as in “bit flipped due to glitch.”


Pretty clever. Is that still how it works?


Pretty sure this paper describes what they're doing now: https://research.fb.com/publications/flighttracker-consisten...


I'm not sure if FlightTracker completely replaced the need for the internal consistency inside Tao. You can read about that here: https://www.usenix.org/system/files/conference/atc13/atc13-b...


Interesting that this sounds very similar to how multiplayer games do it.


This is indeed correct. Wikimedia overall uses less than 2000 bare-metal servers, so yes the infrastructure is tiny compared to those.

What can be interesting, I think, is that you have a completely open infrastructure that has to solve problems on a global traffic scale.

If people are interested in knowing more, I suggest you also take a peek at the wikimedia techblog, specifically to the SRE category https://techblog.wikimedia.org/category/site-reliability-eng... and the performance one https://techblog.wikimedia.org/category/performance/


Search is also largely read-only. The advantage Wikipedia has is that its traffic overwhelming goes to the head of the page distribution, so simple caching solutions work very well. Google has a pretty extreme long-tail distribution (~15% of daily queries have never been seen before), and so needs to do a lot of computation per query.


> Google has a pretty extreme long-tail distribution (~15% of daily queries have never been seen before)

Do you have a source for this?

I'd be willing to bet that the ONLY reason why 15% of their daily queries "haven't been seen before" is because they add un-needed complexity like fingerprinting. You're making it seem like they've never seen a query for "cute animals" before when obviously they have. They choose to do a lot of extra leg work because of who you are.

So your claim that 15% of their queries have "never been seen before" is probably inaccurate. I'd be willing to bet that "15% of their queries are unique because of the user, location, or other external factor separate from the query itself."

They've seen your query before. They've just never seen you make this query from this device on this side of town before.


If you took into account user, location, etc. 15% seems too low. I almost never search for the exact same thing twice in the same location.

15% of the queries themselves are unique. https://blog.google/products/search/our-latest-quality-impro...

https://www.google.com/search/howsearchworks/responses/

I work for Google (and used to work on Search).


I'd be interested in seeing how polluted that 15% of new queries is with people blasting malformed URLs or FQDNs into the omnibox of Chrome.


What's so unbelievable about 15%? I personally think it is way lower than I expected. We're clearly not googling in the same way.


I agree with you. Also in my experience less tech-savvy people tend to overcomplicate their queries instead of just entering the relevant keywords which I'm sure accounts for many uniques.


The point is not that. It’s that when you search for “cute animals”, Google shouldn’t be storing that you searched for that, or even care. Your location is arguably potentially relevant but it could be coarse enough except when searching for directions to allow at least some caching.


Hey Igor! Hate to be a bore, but I wanted to provide feedback that your comment may unintentionally come across as aggressive. OP has pretty relevant work experience that I know I’d love to hear more about, but there’s not really any room for them to respond.

I know many folks IRL who work at big tech who have no interest in posting here because the community comes off as very unwelcoming. That’s a shame, because they have insight that would be great to hear. Regardless of anyone’s opinion of their employer.

Apologies in advance if your intent was purely about the topic. I just thought I read something in your tone that might hinder discourse rather than encourage it. I wanted to point it out, in case it was unintentional.


Agreed about the tone. The comment could have been less argumentative — instead of "that's not the point," they could have said "that's not the only reason."

On the other hand, if I'm not responding, it's not because I find HN too abrasive — it's because I am afraid of leaking non-public information. That's why whenever I talk about Google, I try to cite a Google blog post or other authoritative source, or talk about my own personal experience; hence, "I rarely search for the same query twice."


I’m gonna have to disagree with the negative comments above concerning Igors tone. He made his point with clear respectful language that I would be happy to entertain at work, at the bar, at worship or while on a (previous to covid) group run or golf outing. so, to me, it looks like instead of an ‘agree to disagree’ while respecting each other, you disrespect igor by dismissing his arguments due to his tone, which handily allows you to ignore his content, such as it is. Therefore, in my judgement, you guys are being unfair to Igor while also being disingenuous about your reason for policing his tone.


> disrespect igor by dismissing his arguments due to his tone, which handily allows you to ignore his content, such as it is

I didn't dismiss his argument; I said that he was correct right after he posted: https://news.ycombinator.com/item?id=26073488

"That's not the point" can be interpreted as respectful, but it also can be interpreted as argumentative. I chose to assume good intentions, but I offered a different phrasing that would have a higher chance of not being misinterpreted: i.e. using "yes and" instead of "no but": https://www.theheretic.org/2017/yes-and-vs-no-but/


I apologize for the tone. It the start of my comment was clumsily wired and it wasn’t my intention to have it come off as argumentative. The way I read the GP comment to mine was talking about how Google’s tracking of its users’ telemetry was what was contributing to the uniqueness of requests. Your comment to me boiled down to the fact that of course most requests are unique because of tracking location data and the user account. There seemed to be a disconnect because your comment took for granted that user location and account were a part of the search query while the person you were replying to specifically challenged that notion (again in my reading of both). I tried to post a concise bridge between the two concepts, and of course we all see how well I did with that :)

Having said that, I do think this is clearly a sensitive issues, not a purely technical one. I can appreciate the nuance of working for Google and doing excellent work while seeing the company criticized left and right for its business model. I think given the community, while there is opposition to how Google may at certain points conduct itself as a corporation, there is no lack of respect for any individual working there. I certainly view my comment and the discussion of privacy as having 50% to do with Google’s strategy and 50% to do with the technical aspects of whether you can build a search engine that holds user privacy as a core priority rather than trying to launch an ad hominem on you or anyone. And I saw your other comment that agreed with me and the GP comment so I think my first sentence aside, we are on the same page :)


Thanks for responding constructively, Igor.


To me Igors comment is also displaced. He injects activism into a technical discussion (sadly happens very often here on HN). We all know by now that the bigcorps are to a large degree based on data collection. We do not need to be reminded about it each and every day. We are adults, if we don't like it we use alternatives.


Yeah, this is a fair point. My larger point was mostly that HN misses out on some valuable comments by insiders because those people are disincentives by some of the rhetoric and tone when an article on big tech is popular. I didn’t think the comment I replied to was particularly aggressive - it was just something that came to mind when I read it. OP was actually very kind and constructive in their response - a good ending and constructive discussion for us all!


This is right on the money — getting search results for queries that are too personalized to e.g. location means that you can't cache those search results (or if you did cache them, their entries would be useless).


Right, you can cache that query. That doesn't mean that you can cache "two bunnies playing in the snow r/aww reddit".


I think you mean "15% seems too high". Any easy way to think about this is the following: even if search the entire internet you will almost never see the same sentence twice, assuming it's has a certain number of words. There is a combinatorial explosion in possible sentences to write. Search queries are essentially just sentences without stopwords.


Removing stop words is what old school users of IT systems do, because that's what we learned worked best at the time.

Internet users who came online later, from GenZ to many boomers, will often just write conversational sentences and questions.


I don't understand how you compute that estimate.

I doubt you store the history of all searches ever? People don't need a google account to query the engine, others disable history, etc.

Are you saying you still have all searches ever made ever? Because you would need this to say a query hasn't been made before wouldn't you?


Why would you not store every search ever? It's only a few petabytes, and you can find out all sorts of useful info from it.


I don't know how they did it but I suspect that it wouldn't be very hard to model the distribution by sampling a few million queries and extrapolate from that.


You'd only need to store the list of unique searches, but even if that's true and the 15% number is true, that must be a huge amount of data.


https://blog.google/products/search/our-latest-quality-impro...

"There are trillions of searches on Google every year. In fact, 15 percent of searches we see every day are new"


It would still be helpful to know what ‘new’ means.

Does it mean literally the text string typed into the box by the user is new?

Or does it mean the text string combined with a bunch of other inferred parameters we don’t know about is new?


At the lower bound, that's 150 Billion "new" searches per year. There are approximately 50,000 unique english words not including names or misspellings. If google searches were on average for 3 words, it would take 833 years at that rate to go through all the combinations.

Alternatively, if we assume that google has already recorded 20 Trillion unique search queries (~ 1 Trillion new ones per year for 20 years), the odds that a query composed of 3 correctly spelled english words that are not names has been seen before is 1.6%. Even if we restrict queries to those using the most common 1000 words, there's a 50/50 chance of a query composed of 4 words being unique.

Of course people do not just type random words into the search bar and some terms will be searched many thousands or even millions of times, but still if anything the fact that 85% of searches aren't unique seems surprising.


> Does it mean literally the text string typed into the box by the user is new?

I guess that could be the case. Many could be related to things that are on the news. Like, 'the cw powerpuff girls' for the new show that was announced. No one was searching for that until the announcement, probably


Right - but without clarification from Google, we really don’t know what it means.


New for the day or new for the history of the engine?


I think GP post has a point. I've noticed people use Google really differently from how I do. E.g. I would go search for "figure concave" while my brother would search a longer phrase.

Also, speaking of people behaviour, it would not make sense to search everyday for "cute animals", but the volume of searches done for new things people discover as they get older would make more sense. I mean just look at search trends for things like "hydroxychloroquine" for example (and that's not to mention people who get it wrong, i.e. other factors for differing search queries too)

Also, other languages can change the queries depending on how you phrase the sentence too. Add to that the people using other ways to search instead of just visiting google.com and I think you can get pretty close to 10%.

If fingerprinting is the reason, 15% would be a figure too low I surmise. Would that be the case I think that would make probably 20-25% of searches rather than 15%.

It could very well be that they do classify fingerprinted search differently only in some countries and not others? That would/might explain the 15% figure.

I might be wrong and under-estimated fingerprinting techniques for Google. If they have really good fingerprinting techniques, that would reduce the estimate I have in mind to a better number (close to 15, maybe?)


So consider your hydroxychloroquine example again this way;

Nobody has ever searched for hydroxychloroquine before today. Today is the day the word is hypothetically invented. Today 2 million people will search for hydroxychloroquine. But only one of them was the first to do it.

What I know about pop-culture and viral internet culture is telling me that 15% of 1 trillion searches being unique is shady math.

So I am not fully convinced that the 15% claim is completely transparent.


It's a guess, but my thinking is that previously most people who searched term hydroxychloroquine were mainly scientists and other people related to that not your general population. Suddenly covid happens and now large numbers of people learn about this new drug they never heard before, they are gonna search, and I presume this, most wildly different things like: "how does it work?" "does it cause some disease?" "insert something political here about hydroxychloroquine" "did aliens make hydroxychloroquine?" and many more things I lack imagination to come up with and that's only about hydroxychloroquine. I doubt 15% number is about single word cases, but more about combination of words and that seems reasonable. Inventing new words daily seems unlikely, chaining them on the other hand seems plausible.


The vast majority of people don't search for [hydroxychloroquine]. They search for [Is hydroxychloroquine effective in treating COVID-19?] or [What is the first drug that was approved to treat COVID-19?] or [What methods do we currently have to treat COVID-19?]. You can see these on the search results page as the "Common questions related to..." widget. How else do you think Google gets that data?

The folks who use keyword-based searches are largely those who got on the Internet before ~2007. Tech-savvy, relatively well-off, usually Millenial or Gen-X, plugged into trends. This happens to be the demographic dominant at Hacker News. But there's a much larger demographic who just types in whatever they're thinking of, in natural language, and expects to get answers.

Come to think of it, this is also the demographic that doesn't use tabbed browsing, and uses whichever browser ships with their OEM, and often doesn't realize that there's a separate program called a "browser" running when they click on the "Internet", and issues a Google Search for [google] (#3 query in 2010) when they want to get to Google even though they're on Google already but don't realize it, and doesn't know what a URL is. When a big-tech company makes a brain-dead usability decision you don't like, first consider how that usability choice might appear to your grandmother and it might not seem so brain-dead.


> So your claim that 15% of their queries have "never been seen before" is probably inaccurate.

I'm not sure, on my productive days maybe >50% of my Google searches are not very cachable. (for example, I just googled "htop namespace", "htop novel bytes", "htop pss", "htop nightly build ubuntu 14.04")


https://blog.google/products/search/our-latest-quality-impro...

They briefly mention the statistic in the last paragraph.


It's somewhat analogous to the claim that almost every spoken sentence had never been spoken before in the history of language.


Well, you'd be wrong


Meh, it happens.


Wikimedia also has less incentive/drive to meticulously track every interaction on their pages. The level of tracking present on Facebook and Google has to be extremely computationally intensive.


I agree. Another (no contradiction) way of looking at this is that Wikimedia infrastructure is radically different because Wikimedia is radically different.

They need it to be a certain way in order to operate. The limitations and advantages of how software gets made. Why it gets made. The way the software works. How and why product decisions were made over the last 2 decades. What resources they have/had available. It's all a totally different game. Not surprising that different soil and a different climate grow different plants.

One of Google's early coup d'etats, when they were a strategic step ahead of the boomers, was bankrolling gmail, youtube and such. Gmail offered free giant inboxes. They got all the customers. This cost billions (maybe 100s of millions), but storage costs go down every year while the value of ads/data/lock-in and such go up every year. Similar logic for youtube. (1) Buy a leading video-sharing site; (2)bankroll HD streaming because you have the deepest pockets (3) Own online free TV entirely.

That's who Google is, good or bad. How funding works. What products get built. What infrastructure is necessary, possible, affordable. All interlinked. Wikipedia & Google were founded at the same time. Within 5 years (circa 2006) Google was buying charters and fiefdoms. Wikimedia, meanwhile, was starting to take flak for raising 3 or 4 million in donations.

It's kinda crazy that Wikipedia is comparable in scale to FAANGs when you consider these disparities.


You can do quite a bit of processing per page load without issue. Facebook and Google just take it rather past that point into near absurdity, while still being highly profitable.


To be fair, there's a bit of a combinatoric effect of scale * features going on there. I'm sure you could build most of a Facebook equiv. 100x-1000x cheaper if it only served one city instead of the whole planet.


The effects of scale are less combinatoric than you might think. Most people on my Facebook feed are from the same city anyway, even though Facebook is global.


The effects and scale of sales (ads) are very combinatoric, though.


Yeah why do they keep spending billions to build new datacenters when they could just stop being absurd instead?

The contempt on here is crazy sometimes.


The idea of marginal value/marginal cost is that companies will generally continue spending one billion dollars to add size and complexity, as long as they get back a bit more than a billion dollars in revenue.

So it wouldn't necessarily be contradictory if most of their core functionality could be replicated very simply, yet the actual product is immensely complicated. I forget where I first read this point, but probably on HN.


Or maybe you're just reading too much into "absurd" which can just be a colorful word for "an extremely huge amount"


I don't think that Facebook/Google developers are foolish or incompetent. That would be contempt. Instead, I think that Facebook and Google as conglomerate entities are fundamentally opposed to my right to privacy. That they make decisions to rationally follow their self-interest does not excuse the absurd lengths to which they go to stalk the general population's activities.


> I don't think that Facebook/Google developers are foolish or incompetent.

Nobody in this thread is saying that. Parent to you said:

> they could just stop being absurd instead [of building more DCs]

implying FB could build fewer DCs by scaling down some of their per-page complexity/"absurdity". Basically saying their needs are artificial or borne of requirements that aren't.

> conglomerate entities are fundamentally opposed to my right to privacy

That's a common view, but it's not on topic to this thread. This thread is mostly about the tech itself and how WikiMedia scales versus how the bigger techs scale. It has an interesting diversion into some of the reasons why their scaling needs are different.

You could instead continue the thread stating that they could save a lot of money and complexity while also tearing down some of their reputation for being slow and privacy-hostile by removing some of the very features these DCs support (perhaps) without ruining the net bottom line.

This continues the thread and allows the conversation to continue to what the ROI actually is on the sort of complexity that benefits the company but not the user.


I was the one saying absurdity and I think you’re missing the context. Work out how much processing power is worth even just another 1 cent per thousand page loads and perfectly rational behavior starts to look crazy to the little guys.

Let’s suppose the Facebook cluster spends the equivalent of 1 full second of 1 full CPU core per request. That’s a lot of processing power and for most small scale architectures likely adding wildly unacceptable latency per page load. Further, as small scale traffic is very spiky even low traffic sites would be expensive to host making it a ludicrous amount of processing power.

However, Google has enough traffic to smooth things out, it’s splitting that across multiple of computers and much of it is after the request so latency isn’t an issue, and it isn’t paying retail so processing power is little more than just hardware costs and electricity. Estimate the rough order of magnitude their paying for 1 second of 1 core per request and it’s cheap enough to be a rounding error.


Every request at FB is handled in a new container. This isn’t absurd, it’s actually pretty neat :)

Edit: I don’t know what I’m talking about. Happy Monday!


What? Are you calling the context of a HHVM request a container just to confuse people?

Also, there's way more than just the web tier out there.


Wasn’t my intention to confuse, just repeating something I’ve been told by FB folks.

Everyone, please listen to Rachel and never ever me.


Wow that sounds interesting, does anyone know if this is true?


I'm not on the team that handles this, but I highly doubt that this is the case.


is not neat... is freakish


Does it hurt their caching if you browse Wikipedia when signed in?

I recall reading HackerNews used to have that problem, unsure if it still does.


Looking at the source of a Wikipedia page it has my username appearing 6 times so I guess it must reduce caching a bit. Though I guess they could cache the user info bits and the rest of the page and just splice them together.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: