20% of requests for Wikimedia Commons are for one image of a flower

jacquesm · on Feb 9, 2021

At the height of the browser wars I once woke up to Microsoft hotlinking a small button for downloading our software from the MSN homepage. I tried to reach someone there for hours but nobody cared enough to do something about it. The image was small (no more than a few K), but the millions of requests that page got were enough to totally kill our server.

Finally, I replaced the image on there with a 'Netscape Now' button. Within 15 minutes the matter was resolved.

aembleton · on Feb 9, 2021

Back around 2002, I had a pdf icon on my website. It got deep linked by a few others but the number one source of traffic came from the website of a lawyer who specialised in intellectual property. There was something on there about how it was illegal to deep link.

I was tempted to replace with goatse but I think I just changed it to a screenshot of his website saying that it was illegal to deep link.

It soon got changed.

jacquesm · on Feb 9, 2021

That's a neat example of recursion :)

lobf · on Feb 10, 2021

I honestly don't understand the issue- he linked to a specific page? What's the problem?

cwmartin · on Feb 10, 2021

I believe they mean the lawyer hotlinked [1] their image so that every visitor to the lawyer’s page would result in an image download from their server.

[1]: https://en.wikipedia.org/wiki/Inline_linking

aembleton · on Feb 10, 2021

He deeplinked to an image file, whilst claiming on his website that such action is illegal.

renerthr · on Feb 11, 2021

Is deeplinking the same as hotlinking?

gumby · on Feb 9, 2021

Even though it's not illegal!

rdescartes · on Feb 9, 2021

One of my friends used the same strategy to block DDOS from China : just put "Falun Gong" on there and it was resolved instantly.

himinlomax · on Feb 9, 2021

I remember someone doing that with the goatse picture. The hotlinker was pissed and all sorts of amusing drama ensued.

daveslash · on Feb 9, 2021

That was exactly how I learned what goatse was. My MySpace page was all decked out with images that I was hotlinking from some server... The server owner realized this and replaced all the images with Goatse. One day a friend goes "Hey... uh, what's up with your MySpace page... that's pretty gross". So I went to log in: Goastse. Goastse everywhere (gestures with hand). And my eyes were never the same again ಠ_ಠ

Edit: grammar.

MisterTea · on Feb 9, 2021

That was popular in the early ebay days when you had to host your own images. A friend had someone selling similar items using his image links. So he changed the images to goatse. Problem solved.

gipp · on Feb 9, 2021

The Tribalwar forums did this to CNN after 9/11, CNN had hotlinked one of those images where people were trying to pick out "demon faces" in the smoke

CogitoCogito · on Feb 10, 2021

It wasn't only CNN. A bunch of big news sites linked directly to the image hosted at tribalwar. It all started with some news video of one of the WTC towers smoking. Someone on the forum screenshoted the video and asked "what is this?" because the smoke produced this weird devil-like formation. That picture goes spread around and soon news sites started writing stories saying that triablwar had photoshopped the image and that they were evil and making fun of a tragedy blah blah blah. So basically the news sites were DDOSing tribalwar and lying about them to make them look bad in their sensationalist articles. The administrators of the forums send many emails begging them to stop directly linking to the site and it only got worse and worse. Finally they replaced the image with goatse (with text overplayed giving the true story). If I remember correctly the image was viewed by hundreds of thousands or maybe more people before they were totally removed. That was how tribalwar goatsed the internet. It really was quite legendary.

panpanna · on Feb 10, 2021

According to the article IPs downloading this image come mostly from India.

So replace it with the pakistani flag to solve the problem (or start WW3)

thaumasiotes · on Feb 9, 2021

> One of my friends used the same strategy to block DDOS from China : just put "Falun Gong" on there and it was resolved instantly.

...because attacks from China are horrified at the thought of disrupting Falun Gong?

dspillett · on Feb 9, 2021

Because it is one of the things that will get you added to the blocklists that form part of the Great Firewall of China.

It won't stop a hacker who is probably bypassing parts of that anyway, but the more casual requests such as those caused by deep linking will generally stop getting through.

failrate · on Feb 9, 2021

We used something like this technique back in the Flash days. Sites would straight up steal your games, so one defense was to have the game grab its sprites from a server local endpoint. Thieving sites would get either no graphics or deliberately corrupted graphics.

snoshy · on Feb 9, 2021

The old school response, weaponized without being inappropriate.

rattray · on Feb 9, 2021

That's hilarious!

Did they continue to link to your software after that? (I'm curious - what was your software?)

jacquesm · on Feb 9, 2021

Yes, they did, they actually thought it was quite funny. They even cached the actual download once they realized we wouldn't be able to deal with that either. The software was the first version of the public peer-to-peer webcam software I wrote:

http://web.archive.org/web/20000510010712/http://www.camarad...

phinnaeus · on Feb 9, 2021

Oh my! This is a blast from the past. I was a kid, probably 10 years old or something, and I had a LEGO MovieMaker webcam. I was trying to set it up as a sort of security/monitoring camera for the back door of the small business my parents ran. I remember using this software and supposedly getting it working.

I invited my parents to come see what I had done, and somehow typed the website wrong and ended up on a spanish-language porn site. I could not hit the back button fast enough. Possibly one of the most embarrassing memories of my childhood.

I have no idea what my parents thought I was up to.

jacquesm · on Feb 9, 2021

Heh. Hilarious story, thank you! Camarades.com had just about everything, from people being born to people dying and everything in between. It was a pretty honest (sometimes brutally honest) slice of life.

One of the most popular cams for years was an old person that was extremely ill and that rarely moved but he had pretty big fanclub and he thought it was quite funny that he was more famous on what eventually became his deathbed than he had ever been while he was still active. After he died his family asked to remove all the images and close the account which of course we did. Makes you wonder if all those people wishing him well over the years kept him going a bit longer. What is interesting is that if you did this today I'm pretty sure the jerks would drown out the nice people by a considerable margin, of course there were jerks back then as well, but on the whole the internet seemed to be a much nicer place to hang out than it is today.

prawn · on Feb 9, 2021

Not sure if you're aware, but it's interesting that you mention Lego as the person you're responding to once accidentally bought literally tons of bulk Lego and later designed an automated Lego sorting machine. It's a fun read:

https://jacquesmattheij.com/sorting-two-metric-tons-of-lego/

alickz · on Feb 9, 2021

Haha I know that pain.

When I was a kid I asked my mom to print me out Grand Theft Auto cheats from Gamewinners.com while she was in work.

Somehow I got the address wrong and she wanted to know why I wanted to print out pages and pages from a site dedicated to men cheating on their wives. Got there in the end though and I still have some of those GTA cheats memorised.

acct776 · on Feb 9, 2021

Your mom might have another family in the Greater Toronto Area now, just so you are aware!

macintux · on Feb 9, 2021

My then-wife was watching over my shoulder once as I typed something into the address bar. “Freshmeat.net” auto-completed, drawing a suspicious look from her.

rattray · on Feb 9, 2021

Beautiful. The internet was a truly different place back then...

jacquesm · on Feb 9, 2021

With 100K visitors / day or so we were in the top 30 websites world wide in 1998. The really big boosts came from the Space Shuttle webcasts and an Yves St. Laurent fashion show webcast from Paris.

Hard to believe now, a typical blog post will already pick up 30K visitors without too much trouble.

NetOpWibby · on Feb 9, 2021

I could listen to stories of the Old Net all day.

jacquesm · on Feb 9, 2021

Enjoy:

https://jacquesmattheij.com/story-behind-wwcom-camaradescom/

And apologies for the non-working images.

NetOpWibby · on Feb 11, 2021

Sweet, looking forward to reading.

gowld · on Feb 9, 2021

Serves you right for hotlinking ;-)

jacquesm · on Feb 9, 2021

Yes, but at least it was my own domain :)

I didn't see that consequence coming when camarades.com shut down. I really should dig up those images and repair the blog but the todo list isn't really getting any shorter on this end.

xrisk · on Feb 9, 2021

How is this on the Wayback machine?!

loktarogar · on Feb 9, 2021

You can click "about this capture" for more information

> Starting in 1996, Alexa Internet has been donating their crawl data to the Internet Archive. Flowing in every day, these data are added to the Wayback Machine after an embargo period.

Thorrez · on Feb 9, 2021

Fun fact: Amazon's home assistant was named after Alexa Internet. Amazon owns Alexa Internet.

phpnode · on Feb 9, 2021

it's not named after it, it's just amazon is so massive they have to reuse brand names. AWS has exhausted not only the supply of IPv4 addresses but also the supply of 3 letter initialisms.

renerthr · on Feb 11, 2021

What's an embargo period?

jacquesm · on Feb 14, 2021

A period during which someone who knows something or has something does not release it to the public.

DonHopkins · on Feb 9, 2021

As pioneer of "<something> On Internet", do you regret not turning out like Russ Hanneman? ;) (OR DID YOU???!)

https://www.youtube.com/watch?v=BzAdXyPYKQo&ab_channel=yate5...

https://silicon-valley.fandom.com/wiki/Russ_Hanneman

I'm just glad I didn't turn out like Erlich Bachman! (OR DID I???!)

https://www.reddit.com/r/SiliconValleyHBO/comments/4jmlv9/wh...

https://silicon-valley.fandom.com/wiki/Erlich_Bachman

vidarh · on Feb 9, 2021

I've finally gotten around to watching this series, and it's disturbing how many moments I've watched that were more familiar than they should have been, and too many characters I could instantly put a real name to....

emag · on Feb 10, 2021

Around...2005? 2006? I discovered someone had deep-linked to an image on work's webserver, where I was admin (being one of the few who knew Linux).

Instead of just outright replacing the image, I set up rules in Apache to check the referer, and if it was our site, serve the correct image. Anyone else, it served up something...questionable.

Problem solved.

Lutzb · on Feb 11, 2021

In the early 2000s a friend of mine was an active seller on ebay. He shot his own product pictures and professionally designed the description pages. Soon enough his content was stolen by competing shops, hotlinking the images from his server.

Ebay didn't care. So obviously the only option was to create a script that would randomly change the images to something unpleasant (think early 2000s rotten internet content).

Good times.

IfOnlyYouKnew · on Feb 9, 2021

Wikimedia is unique in running some of the most popular websites with open access to almost all systems. As someone who has never been on the inside of FAANG, I found it rather interesting to browse around the backend infrastructure.

See, for example, their statistics at https://grafana.wikimedia.org/d/000000102/production-logging...

ShakataGaNai · on Feb 9, 2021

Wikimedia's infrastructure is radically different than most FAANG.

In large part because 99% (+/-) of their traffic is read only. While Facebook and Google have to do heavy workloads for every click and action taken on their services, Wikimedia can cache basically everything. Allowing them to operate on a tiny fraction of the number of machines (and infrastructure) that the rest of the players do.

ashtonkem · on Feb 9, 2021

They also have looser latency SLAs. The only hard requirement is that a user can read back their own writes, but it’s okay if other users are served stale data for a few seconds or minutes even. This makes cache invalidation, one of the most notoriously difficult and expensive operations at large scale, much much easier.

nostrademons · on Feb 9, 2021

Facebook also has a similar SLA. I've heard that at one point in their architecture (~2010), they literally stored the user's own writes in memcached and then merged them back into the page when rendered. You would see a page consistent with your actions, but if you logged into Facebook as any of your friends your updates might not show up until replication lag passed.

mackman · on Feb 9, 2021

Close, IIRC we cached the fact you had just done a write, and a subsequent read request that arrived on the replica region was then proxied to the primary region instead of serviced locally.

cranekam · on Feb 9, 2021

My memory is fuzzy now but this dates back to when there were only two datacenter regions and one of them held all the primary DBs (2011 or so). All write endpoints were served in that region, so if a user routed to the secondary region did a write the request was proxied to the primary region. After doing a write a cookie was set for the user in question which caused any future reads to be proxied to the primary region for a few seconds while the DB replication stream (upon which cache invalidation was piggybacked) caught up, because if they went to the secondary region memcached was now stale.

It hasn’t been this way since around 2013 but again I am fuzzy on how. I think that’s when most such data was switched to TAO, which has local read what you wrote consistency. As long as users landed in the same cluster (and thus TAO cluster) what they wrote was visible to them, even if the DB write hadn’t yet replicated to their region.

FlightTracker postdates my time at FB (ended 2018ish) so I’m not sure how that is used. These systems evolved a lot over time as requirements changed.

I don’t remember anything about writes being batched in memcached and merged in on page load.

eismcc · on Feb 9, 2021

Dirty bits at scale

gogopuppygogo · on Feb 9, 2021

I’m guessing this is ecc memory so likely correcting for bad data.

Hackbraten · on Feb 9, 2021

I think they meant dirty bit as in “a flag that means update needed,” not as in “bit flipped due to glitch.”

IgorPartola · on Feb 9, 2021

Pretty clever. Is that still how it works?

glittershark · on Feb 9, 2021

Pretty sure this paper describes what they're doing now: https://research.fb.com/publications/flighttracker-consisten...

mackman · on Feb 9, 2021

I'm not sure if FlightTracker completely replaced the need for the internal consistency inside Tao. You can read about that here: https://www.usenix.org/system/files/conference/atc13/atc13-b...

dash2 · on Feb 9, 2021

Interesting that this sounds very similar to how multiplayer games do it.

_joe · on Feb 9, 2021

This is indeed correct. Wikimedia overall uses less than 2000 bare-metal servers, so yes the infrastructure is tiny compared to those.

What can be interesting, I think, is that you have a completely open infrastructure that has to solve problems on a global traffic scale.

If people are interested in knowing more, I suggest you also take a peek at the wikimedia techblog, specifically to the SRE category https://techblog.wikimedia.org/category/site-reliability-eng... and the performance one https://techblog.wikimedia.org/category/performance/

nostrademons · on Feb 9, 2021

Search is also largely read-only. The advantage Wikipedia has is that its traffic overwhelming goes to the head of the page distribution, so simple caching solutions work very well. Google has a pretty extreme long-tail distribution (~15% of daily queries have never been seen before), and so needs to do a lot of computation per query.

zelon88 · on Feb 9, 2021

> Google has a pretty extreme long-tail distribution (~15% of daily queries have never been seen before)

Do you have a source for this?

I'd be willing to bet that the ONLY reason why 15% of their daily queries "haven't been seen before" is because they add un-needed complexity like fingerprinting. You're making it seem like they've never seen a query for "cute animals" before when obviously they have. They choose to do a lot of extra leg work because of who you are.

So your claim that 15% of their queries have "never been seen before" is probably inaccurate. I'd be willing to bet that "15% of their queries are unique because of the user, location, or other external factor separate from the query itself."

They've seen your query before. They've just never seen you make this query from this device on this side of town before.

tylerhou · on Feb 9, 2021

If you took into account user, location, etc. 15% seems too low. I almost never search for the exact same thing twice in the same location.

15% of the queries themselves are unique. https://blog.google/products/search/our-latest-quality-impro...

https://www.google.com/search/howsearchworks/responses/

I work for Google (and used to work on Search).

zelon88 · on Feb 9, 2021

I'd be interested in seeing how polluted that 15% of new queries is with people blasting malformed URLs or FQDNs into the omnibox of Chrome.

Kiro · on Feb 9, 2021

What's so unbelievable about 15%? I personally think it is way lower than I expected. We're clearly not googling in the same way.

simias · on Feb 9, 2021

I agree with you. Also in my experience less tech-savvy people tend to overcomplicate their queries instead of just entering the relevant keywords which I'm sure accounts for many uniques.

IgorPartola · on Feb 9, 2021

The point is not that. It’s that when you search for “cute animals”, Google shouldn’t be storing that you searched for that, or even care. Your location is arguably potentially relevant but it could be coarse enough except when searching for directions to allow at least some caching.

edmundsauto · on Feb 9, 2021

Hey Igor! Hate to be a bore, but I wanted to provide feedback that your comment may unintentionally come across as aggressive. OP has pretty relevant work experience that I know I’d love to hear more about, but there’s not really any room for them to respond.

I know many folks IRL who work at big tech who have no interest in posting here because the community comes off as very unwelcoming. That’s a shame, because they have insight that would be great to hear. Regardless of anyone’s opinion of their employer.

Apologies in advance if your intent was purely about the topic. I just thought I read something in your tone that might hinder discourse rather than encourage it. I wanted to point it out, in case it was unintentional.

tylerhou · on Feb 9, 2021

Agreed about the tone. The comment could have been less argumentative — instead of "that's not the point," they could have said "that's not the only reason."

On the other hand, if I'm not responding, it's not because I find HN too abrasive — it's because I am afraid of leaking non-public information. That's why whenever I talk about Google, I try to cite a Google blog post or other authoritative source, or talk about my own personal experience; hence, "I rarely search for the same query twice."

rStar · on Feb 9, 2021

I’m gonna have to disagree with the negative comments above concerning Igors tone. He made his point with clear respectful language that I would be happy to entertain at work, at the bar, at worship or while on a (previous to covid) group run or golf outing. so, to me, it looks like instead of an ‘agree to disagree’ while respecting each other, you disrespect igor by dismissing his arguments due to his tone, which handily allows you to ignore his content, such as it is. Therefore, in my judgement, you guys are being unfair to Igor while also being disingenuous about your reason for policing his tone.

tylerhou · on Feb 9, 2021

> disrespect igor by dismissing his arguments due to his tone, which handily allows you to ignore his content, such as it is

I didn't dismiss his argument; I said that he was correct right after he posted: https://news.ycombinator.com/item?id=26073488

"That's not the point" can be interpreted as respectful, but it also can be interpreted as argumentative. I chose to assume good intentions, but I offered a different phrasing that would have a higher chance of not being misinterpreted: i.e. using "yes and" instead of "no but": https://www.theheretic.org/2017/yes-and-vs-no-but/

IgorPartola · on Feb 9, 2021

I apologize for the tone. It the start of my comment was clumsily wired and it wasn’t my intention to have it come off as argumentative. The way I read the GP comment to mine was talking about how Google’s tracking of its users’ telemetry was what was contributing to the uniqueness of requests. Your comment to me boiled down to the fact that of course most requests are unique because of tracking location data and the user account. There seemed to be a disconnect because your comment took for granted that user location and account were a part of the search query while the person you were replying to specifically challenged that notion (again in my reading of both). I tried to post a concise bridge between the two concepts, and of course we all see how well I did with that :)

Having said that, I do think this is clearly a sensitive issues, not a purely technical one. I can appreciate the nuance of working for Google and doing excellent work while seeing the company criticized left and right for its business model. I think given the community, while there is opposition to how Google may at certain points conduct itself as a corporation, there is no lack of respect for any individual working there. I certainly view my comment and the discussion of privacy as having 50% to do with Google’s strategy and 50% to do with the technical aspects of whether you can build a search engine that holds user privacy as a core priority rather than trying to launch an ad hominem on you or anyone. And I saw your other comment that agreed with me and the GP comment so I think my first sentence aside, we are on the same page :)

rectang · on Feb 9, 2021

Thanks for responding constructively, Igor.

WanderPanda · on Feb 9, 2021

To me Igors comment is also displaced. He injects activism into a technical discussion (sadly happens very often here on HN). We all know by now that the bigcorps are to a large degree based on data collection. We do not need to be reminded about it each and every day. We are adults, if we don't like it we use alternatives.

edmundsauto · on Feb 9, 2021

Yeah, this is a fair point. My larger point was mostly that HN misses out on some valuable comments by insiders because those people are disincentives by some of the rhetoric and tone when an article on big tech is popular. I didn’t think the comment I replied to was particularly aggressive - it was just something that came to mind when I read it. OP was actually very kind and constructive in their response - a good ending and constructive discussion for us all!

tylerhou · on Feb 9, 2021

This is right on the money — getting search results for queries that are too personalized to e.g. location means that you can't cache those search results (or if you did cache them, their entries would be useless).

ma2rten · on Feb 9, 2021

Right, you can cache that query. That doesn't mean that you can cache "two bunnies playing in the snow r/aww reddit".

ma2rten · on Feb 9, 2021

I think you mean "15% seems too high". Any easy way to think about this is the following: even if search the entire internet you will almost never see the same sentence twice, assuming it's has a certain number of words. There is a combinatorial explosion in possible sentences to write. Search queries are essentially just sentences without stopwords.

djhn · on Feb 9, 2021

Removing stop words is what old school users of IT systems do, because that's what we learned worked best at the time.

Internet users who came online later, from GenZ to many boomers, will often just write conversational sentences and questions.

jobigoud · on Feb 9, 2021

I don't understand how you compute that estimate.

I doubt you store the history of all searches ever? People don't need a google account to query the engine, others disable history, etc.

Are you saying you still have all searches ever made ever? Because you would need this to say a query hasn't been made before wouldn't you?

freeone3000 · on Feb 9, 2021

Why would you not store every search ever? It's only a few petabytes, and you can find out all sorts of useful info from it.

simias · on Feb 9, 2021

I don't know how they did it but I suspect that it wouldn't be very hard to model the distribution by sampling a few million queries and extrapolate from that.

melq · on Feb 9, 2021

You'd only need to store the list of unique searches, but even if that's true and the 15% number is true, that must be a huge amount of data.

bagels · on Feb 9, 2021

https://blog.google/products/search/our-latest-quality-impro...

"There are trillions of searches on Google every year. In fact, 15 percent of searches we see every day are new"

zepto · on Feb 9, 2021

It would still be helpful to know what ‘new’ means.

Does it mean literally the text string typed into the box by the user is new?

Or does it mean the text string combined with a bunch of other inferred parameters we don’t know about is new?

jjk166 · on Feb 10, 2021

At the lower bound, that's 150 Billion "new" searches per year. There are approximately 50,000 unique english words not including names or misspellings. If google searches were on average for 3 words, it would take 833 years at that rate to go through all the combinations.

Alternatively, if we assume that google has already recorded 20 Trillion unique search queries (~ 1 Trillion new ones per year for 20 years), the odds that a query composed of 3 correctly spelled english words that are not names has been seen before is 1.6%. Even if we restrict queries to those using the most common 1000 words, there's a 50/50 chance of a query composed of 4 words being unique.

Of course people do not just type random words into the search bar and some terms will be searched many thousands or even millions of times, but still if anything the fact that 85% of searches aren't unique seems surprising.

FalconSensei · on Feb 10, 2021

> Does it mean literally the text string typed into the box by the user is new?

I guess that could be the case. Many could be related to things that are on the news. Like, 'the cw powerpuff girls' for the new show that was announced. No one was searching for that until the announcement, probably

zepto · on Feb 10, 2021

Right - but without clarification from Google, we really don’t know what it means.

jobigoud · on Feb 9, 2021

New for the day or new for the history of the engine?

rntksi · on Feb 9, 2021

I think GP post has a point. I've noticed people use Google really differently from how I do. E.g. I would go search for "figure concave" while my brother would search a longer phrase.

Also, speaking of people behaviour, it would not make sense to search everyday for "cute animals", but the volume of searches done for new things people discover as they get older would make more sense. I mean just look at search trends for things like "hydroxychloroquine" for example (and that's not to mention people who get it wrong, i.e. other factors for differing search queries too)

Also, other languages can change the queries depending on how you phrase the sentence too. Add to that the people using other ways to search instead of just visiting google.com and I think you can get pretty close to 10%.

If fingerprinting is the reason, 15% would be a figure too low I surmise. Would that be the case I think that would make probably 20-25% of searches rather than 15%.

It could very well be that they do classify fingerprinted search differently only in some countries and not others? That would/might explain the 15% figure.

I might be wrong and under-estimated fingerprinting techniques for Google. If they have really good fingerprinting techniques, that would reduce the estimate I have in mind to a better number (close to 15, maybe?)

zelon88 · on Feb 9, 2021

So consider your hydroxychloroquine example again this way;

Nobody has ever searched for hydroxychloroquine before today. Today is the day the word is hypothetically invented. Today 2 million people will search for hydroxychloroquine. But only one of them was the first to do it.

What I know about pop-culture and viral internet culture is telling me that 15% of 1 trillion searches being unique is shady math.

So I am not fully convinced that the 15% claim is completely transparent.

no_way · on Feb 9, 2021

It's a guess, but my thinking is that previously most people who searched term hydroxychloroquine were mainly scientists and other people related to that not your general population. Suddenly covid happens and now large numbers of people learn about this new drug they never heard before, they are gonna search, and I presume this, most wildly different things like: "how does it work?" "does it cause some disease?" "insert something political here about hydroxychloroquine" "did aliens make hydroxychloroquine?" and many more things I lack imagination to come up with and that's only about hydroxychloroquine. I doubt 15% number is about single word cases, but more about combination of words and that seems reasonable. Inventing new words daily seems unlikely, chaining them on the other hand seems plausible.

nostrademons · on Feb 9, 2021

The vast majority of people don't search for [hydroxychloroquine]. They search for [Is hydroxychloroquine effective in treating COVID-19?] or [What is the first drug that was approved to treat COVID-19?] or [What methods do we currently have to treat COVID-19?]. You can see these on the search results page as the "Common questions related to..." widget. How else do you think Google gets that data?

The folks who use keyword-based searches are largely those who got on the Internet before ~2007. Tech-savvy, relatively well-off, usually Millenial or Gen-X, plugged into trends. This happens to be the demographic dominant at Hacker News. But there's a much larger demographic who just types in whatever they're thinking of, in natural language, and expects to get answers.

Come to think of it, this is also the demographic that doesn't use tabbed browsing, and uses whichever browser ships with their OEM, and often doesn't realize that there's a separate program called a "browser" running when they click on the "Internet", and issues a Google Search for [google] (#3 query in 2010) when they want to get to Google even though they're on Google already but don't realize it, and doesn't know what a URL is. When a big-tech company makes a brain-dead usability decision you don't like, first consider how that usability choice might appear to your grandmother and it might not seem so brain-dead.

rfoo · on Feb 9, 2021

> So your claim that 15% of their queries have "never been seen before" is probably inaccurate.

I'm not sure, on my productive days maybe >50% of my Google searches are not very cachable. (for example, I just googled "htop namespace", "htop novel bytes", "htop pss", "htop nightly build ubuntu 14.04")

wolfd · on Feb 9, 2021

https://blog.google/products/search/our-latest-quality-impro...

They briefly mention the statistic in the last paragraph.

wwwwewwww · on Feb 9, 2021

It's somewhat analogous to the claim that almost every spoken sentence had never been spoken before in the history of language.

Rastonbury · on Feb 9, 2021

Well, you'd be wrong

zelon88 · on Feb 9, 2021

Meh, it happens.

anang · on Feb 9, 2021

Wikimedia also has less incentive/drive to meticulously track every interaction on their pages. The level of tracking present on Facebook and Google has to be extremely computationally intensive.

dalbasal · on Feb 9, 2021

I agree. Another (no contradiction) way of looking at this is that Wikimedia infrastructure is radically different because Wikimedia is radically different.

They need it to be a certain way in order to operate. The limitations and advantages of how software gets made. Why it gets made. The way the software works. How and why product decisions were made over the last 2 decades. What resources they have/had available. It's all a totally different game. Not surprising that different soil and a different climate grow different plants.

One of Google's early coup d'etats, when they were a strategic step ahead of the boomers, was bankrolling gmail, youtube and such. Gmail offered free giant inboxes. They got all the customers. This cost billions (maybe 100s of millions), but storage costs go down every year while the value of ads/data/lock-in and such go up every year. Similar logic for youtube. (1) Buy a leading video-sharing site; (2)bankroll HD streaming because you have the deepest pockets (3) Own online free TV entirely.

That's who Google is, good or bad. How funding works. What products get built. What infrastructure is necessary, possible, affordable. All interlinked. Wikipedia & Google were founded at the same time. Within 5 years (circa 2006) Google was buying charters and fiefdoms. Wikimedia, meanwhile, was starting to take flak for raising 3 or 4 million in donations.

It's kinda crazy that Wikipedia is comparable in scale to FAANGs when you consider these disparities.

Retric · on Feb 9, 2021

You can do quite a bit of processing per page load without issue. Facebook and Google just take it rather past that point into near absurdity, while still being highly profitable.

throwaway3699 · on Feb 9, 2021

To be fair, there's a bit of a combinatoric effect of scale * features going on there. I'm sure you could build most of a Facebook equiv. 100x-1000x cheaper if it only served one city instead of the whole planet.

klodolph · on Feb 9, 2021

The effects of scale are less combinatoric than you might think. Most people on my Facebook feed are from the same city anyway, even though Facebook is global.

erichurkman · on Feb 9, 2021

The effects and scale of sales (ads) are very combinatoric, though.

ianlevesque · on Feb 9, 2021

Yeah why do they keep spending billions to build new datacenters when they could just stop being absurd instead?

The contempt on here is crazy sometimes.

bo1024 · on Feb 9, 2021

The idea of marginal value/marginal cost is that companies will generally continue spending one billion dollars to add size and complexity, as long as they get back a bit more than a billion dollars in revenue.

So it wouldn't necessarily be contradictory if most of their core functionality could be replicated very simply, yet the actual product is immensely complicated. I forget where I first read this point, but probably on HN.

civilized · on Feb 9, 2021

Or maybe you're just reading too much into "absurd" which can just be a colorful word for "an extremely huge amount"

MereInterest · on Feb 9, 2021

I don't think that Facebook/Google developers are foolish or incompetent. That would be contempt. Instead, I think that Facebook and Google as conglomerate entities are fundamentally opposed to my right to privacy. That they make decisions to rationally follow their self-interest does not excuse the absurd lengths to which they go to stalk the general population's activities.

ryanianian · on Feb 9, 2021

> I don't think that Facebook/Google developers are foolish or incompetent.

Nobody in this thread is saying that. Parent to you said:

> they could just stop being absurd instead [of building more DCs]

implying FB could build fewer DCs by scaling down some of their per-page complexity/"absurdity". Basically saying their needs are artificial or borne of requirements that aren't.

> conglomerate entities are fundamentally opposed to my right to privacy

That's a common view, but it's not on topic to this thread. This thread is mostly about the tech itself and how WikiMedia scales versus how the bigger techs scale. It has an interesting diversion into some of the reasons why their scaling needs are different.

You could instead continue the thread stating that they could save a lot of money and complexity while also tearing down some of their reputation for being slow and privacy-hostile by removing some of the very features these DCs support (perhaps) without ruining the net bottom line.

This continues the thread and allows the conversation to continue to what the ROI actually is on the sort of complexity that benefits the company but not the user.

Retric · on Feb 9, 2021

I was the one saying absurdity and I think you’re missing the context. Work out how much processing power is worth even just another 1 cent per thousand page loads and perfectly rational behavior starts to look crazy to the little guys.

Let’s suppose the Facebook cluster spends the equivalent of 1 full second of 1 full CPU core per request. That’s a lot of processing power and for most small scale architectures likely adding wildly unacceptable latency per page load. Further, as small scale traffic is very spiky even low traffic sites would be expensive to host making it a ludicrous amount of processing power.

However, Google has enough traffic to smooth things out, it’s splitting that across multiple of computers and much of it is after the request so latency isn’t an issue, and it isn’t paying retail so processing power is little more than just hardware costs and electricity. Estimate the rough order of magnitude their paying for 1 second of 1 core per request and it’s cheap enough to be a rounding error.

cmckn · on Feb 9, 2021

Every request at FB is handled in a new container. This isn’t absurd, it’s actually pretty neat :)

Edit: I don’t know what I’m talking about. Happy Monday!

rachelbythebay · on Feb 9, 2021

What? Are you calling the context of a HHVM request a container just to confuse people?

Also, there's way more than just the web tier out there.

cmckn · on Feb 9, 2021

Wasn’t my intention to confuse, just repeating something I’ve been told by FB folks.

Everyone, please listen to Rachel and never ever me.

ROARosen · on Feb 9, 2021

Wow that sounds interesting, does anyone know if this is true?

wilsonthewhale · on Feb 9, 2021

I'm not on the team that handles this, but I highly doubt that this is the case.

robmurrer · on Feb 9, 2021

is not neat... is freakish

MaxBarraclough · on Feb 9, 2021

Does it hurt their caching if you browse Wikipedia when signed in?

I recall reading HackerNews used to have that problem, unsure if it still does.

tim333 · on Feb 9, 2021

Looking at the source of a Wikipedia page it has my username appearing 6 times so I guess it must reduce caching a bit. Though I guess they could cache the user info bits and the rest of the page and just splice them together.

dalbasal · on Feb 9, 2021

In interviews with Jimmy Wales, he seems somewhat regretful of not having made Wikipedia a for-profit. At the least, he's fairly adamant that Wikipedia could have been Wikipedia as a for profit.

The way he structured wikipedia, from back-end infrastructure to ownership/governance structure was just the logical way of doing the project. Times were different. Online culture was different.

I don't want to overinterpret the man, or put words in his mouth... but... I got the impression that Wales thinks that if he was starting Wikipedia now, he'd just do it asd a startup and also succeed.

To me, this is almost sad. Besides being an awesome encyclopedia, wikipedia is existence proof for something of scale outside the norm. Something that isn't a corporation. A lot of things are deterministic to the structure of an organization.

For example, take the current postpostmodern war over truth and stuff: platforming/deplatforming, freedom of speech, censorship, bias, manipulation, narrative = power issues, etc. Wikipedia is at the very centre all these problems. Whatever difficulties Twitter is experiencing should be 100X worse for wikipedia. Meanwhile, Wikipedia is withstanding far better, and with far more integrity. I don't think this is a coincidence.

Dunking on wikipedia's budget/spending is popular. Meanwhile, Wikipedia uses <1% of the resources/budget of Twitter. They are operating @ >100X efficiency compared to a realistic for-profit equivalent. That's a flying shuttle.

We know that Wikipedia, Linux & The Worldwide Web are possible because they exist. We literally wouldn't know otherwise. Theory couldn't have gotten us to this knowledge. Each is existence proof for other ways of doing things. They aren't necessarily roadmaps, but I'm a big believer in existence proofs. What Jimmy made is 100X better, more important and non-inevtiable than what Zuck made. The thought that he wants to be Zuck bums me out.

nolok · on Feb 9, 2021

It would succeed the same way Quora does. Much less open, much less universal, much more user hostile, with an almost agressive way to deal with unlogged user.

In terms of financial and organisational success it would probably largely beat what it is now. It terms of benefit to humanity, it would be much worse.

Company + for profit + laws means access to information has to be much more tailored to the laws of each place. "Let's remove tianamen's article or lose your chinese license" kind of things.

I'm for one am glad for the current wikipedia we have, despite it's numerous flaws. I still donate every year, although I wish Wales could stop having it spend its money the same a startup or FAANG does.

dalbasal · on Feb 9, 2021

That's one option, though I wouldn't necessarily use Quora as a mainline example. They're kind of a $gme for rich people. I think highly enough of Jimmy to bet on him doing a much better job than that.

Stackoverflow is a decent example. Very capable founding team. They explicitly tried to be like a commercial wikimedia. They do embrace quite a lot of openness, notably creative commons... learning from wikimedia successes.

RE "I wish Wales would:" Another consequence for how wikipedia is structured is that Wales isn't the Zuckerberg of Wikimedia. Power is a lot more dispersed.

RE spending/flaws and such: I feel like wikimedia is held to an extremely unfair standard. Who/what should we compare them to?

Wikimedia spend $70m per year. This is probably less than Quora or stackexchange. FB & Twitter (IMO more comparable in terms of scale/importance) spend $55bn & $3bn. Twitter spends 45X more than Wikimedia. Facebook spends almost 1,000X compared to Wikimedia. The bang-for-buck is insane.

Also in terms of flaws in rules/judgement calls. A lot of people are highly critical of wikipedia's "deletionism" related MOs. What articles/edits stay in. How good the rules & procedures are for this. What "camp" has power, and how they treat the other camp. I get that this stuff is contentious.

Meanwhile on Twitter or Facebook, the rule is "I decide." "But it gets us clicks" is the killer argument. Nothing is transparent. Wikimedia is doing a much better job, respecting user & editor rights far more, being a lot less self righteous. Of course it's not perfect, but come on. The "norm" is Facebook's content policy, Twitter's safety department, or Apple's App store approval room. Wikimedia is the one example of being better than that... and for that everyone is always yelling at them.

HeadsUpHigh · on Feb 9, 2021

Quora imo is a horrible website and I rarely find actually good advice on it. At this point I actively avoid clicking on it's links because of how aggressive they are towards non logged users.

BlueTemplar · on Feb 9, 2021

Yeah, the Web was quite impressive (though we already had the Minitel), but it was Wikipedia that really blew my mind (even though we already had Encarta). (In fact I consider Wikipedia to be the Web's "killer app", even more than Google and other search engines were.)

dalbasal · on Feb 9, 2021

Out of all the "killer apps" for the web... wikipedia is the one that implement the www most faithfully. Hypertext articles. Most apps got the web to do x. Wikipedia is what it was made to do.

BlueTemplar · on Feb 9, 2021

Yeah. I was about to add that it pretty much has been Tim Berners-Lees vision coming to fruition, but the fact that Wikipedia is centralized has stopped me. But then isn't the Web itself technically 'centralized' on the Internet ? And isn't Wikipedia a great example of pseudonymous strangers (= social decentralization) collaborating with each other ?

dalbasal · on Feb 10, 2021

Is it really that centralized? Citations and footnote links are a pretty important part of what wikipedia is. I mean, I very rarely click through to see source material, but when I do it's noticeable how much more powerful wikipedia is than a standard encyclopedia.

Vinnl · on Feb 9, 2021

I can also imagine that he'd say that just because it makes him look/feel better, i.e. it's more of a sacrifice if he gave it for free while he also could've been a billionaire, than if this was the only way Wikipedia could ever have been a success.

Then again, WikiTribune was a for-profit.

np_tedious · on Feb 9, 2021

I have, and this is is still fascinating. Got any more links you'd suggest?

tassu · on Feb 9, 2021

https://media.ccc.de/v/36c3-73-infrastructure-of-wikipedia

tomglynch · on Feb 9, 2021

Just added this comment on the issue:

Hi all, I've been doing a bit of research into possible apps that could be causing this and found two potential culprits that I am currently investigating.

The first is Mitron TV, an Indian TikTok alternative which was made available again on the app store June 6th (https://indianexpress.com/article/technology/tech-news-techn...).

The second is Say Namaste, an Indian Zoom alternative which was launched on the app stores June 9th (https://indianexpress.com/article/technology/tech-news-techn...).

Both fall into the timeline of huge increases, have millions of users and may be using '1280px-AsterNovi-belgii-flower-1mb.jpg' to check the users internet connection - especially for Say Namaste to ensure video connectivity. I've reached out to some developers at both companies and will report back. Let me know your thoughts.

EDIT: I have also noticed the dates match the reopening after lockdown for the whole of India: "This first phase of reopening was termed as "Unlock 1.0"[13] and permitted shopping malls, religious places, hotels and restaurants to reopen from *8 June*." (https://en.wikipedia.org/wiki/COVID-19_lockdown_in_India#Unl... )

Tom

batch12 · on Feb 9, 2021

Based on this, I just reversed both Android apps and am not seeing strings related to wikimedia nor asternovi. This doesn't mean it's not obfuscated somehow though. The only app I've found the strings in so far is the "ravn" app proposed by @taviso. As mentioned in the twitter thread though it doesn't seem to have the install base to cause this traffic--

tomglynch · on Feb 9, 2021

Thanks batch12. In my edit, it could also be related to a check-in app used at public spaces in India - as it increases from the 8th of June which matches when the India-wide lockdown began to lift. Perhaps a reverse of qr code scan checkin apps in India could be useful?

batch12 · on Feb 9, 2021

Could be-- I checked about 50 apps from alternative lists that popped up after the ban with no luck except for that one I mentioned before.

Looks like they posted shortly after yours on the ticket that they found the culprit. Guess we'll find out tomorrow if we were on the right path.

tomglynch · on Feb 9, 2021

Yeah hopefully they have a bit of a write up too about how they worked it out - interesting problem to solve!

catlover99 · on Feb 9, 2021

I took a look at the apk and noticed this in the manifest. "com.blockeq.stellarwallet.WalletApplication" Stellar Lumens is a fairly popular crypto currency. I wonder if the app has built in support for crypto transactions. If not, maybe it's malware to mine crypto coins.

https://i.imgur.com/o8DllVd.png

captn3m0 · on Feb 9, 2021

It is a crypto chat application:

>Ravn is your portal to the most private messenger as well as Korrax our proprietary token. Stay up to date with Korrax and other Cryptos and join the crypto group chats.

>Messages, images and docs are never stored on a server (after delivery), they’re only locally stored on your own phone. Ravn is not tied to your phone number or email, you only sign up with a username that isn’t searchable or discoverable.

WeekSpeller · on Feb 9, 2021

Stupid question: how did you reverse the app in Android Studio?

catlover99 · on Feb 9, 2021

I downloaded the APK and then used "Profile or Debug APK" under file in Android Studio and ctrl/cmd+shift+f to search for strings.

I don't know much about Android development or APKs but it's not exactly "reversing." from what I understand the profile/debug converts the .dex files from the APK to .smali which is human readable.

NicolaiS · on Feb 9, 2021

You can use the "Analyse APK" feature, but you probably rather want to use tools like jadx or apktool that provides fairly good decompilation.

thetanil · on Feb 9, 2021

As far as I know, this is also an image commonly used in machine learning tutorials for image classification of species of flowers. I don't know if the tutorials use the mediawiki source directly though. I do recognize this image though. I think it's in the SciKit Learn O'Reilly book.

bombcar · on Feb 9, 2021

I had some random images on a web server years ago - and noticed that something like 99% of my traffic was one image - and searching through refers I realized I was the #1 hit on google images for robot attack cat.

Simpler times.

berkes · on Feb 9, 2021

I had a similar issue. Some 15+ years ago, an image from my blog showed up for people who searched the phrase 'Peanubutter Sex'. The image had nothing to do with peanutbutter nor with sex. My blog is SFW. It was some screenshot of KDE IIRC.

For almost a week it remained the most requested image the post on which it appeared, the most popular.

It did make me uncomfortable, though. Fearing that my rankings would plummet or so.

My takeaway is nothing new: there are weirdo's online.

kristianp · on Feb 9, 2021

So where did the connection between "Peanubutter Sex" and your blog come from? Did you ever find out?

berkes · on Feb 10, 2021

I did not.

My blog did have the word "Peanutbutter" on one or two posts. And the word "sex" on another. Maybe at some point both words showed up close to one another when experimenting with some "random articles" sidebar or some "you may also like" list.

Tijdreiziger · on Feb 9, 2021

Can we see the image? :D

bombcar · on Feb 9, 2021

https://imgur.com/8MMET5V - now there are companies that can host things FOR my servants!

BlueTemplar · on Feb 9, 2021

Anyone knows how imgur is able to afford that ?

kristianp · on Feb 9, 2021

Low cost CDNs like Cloudflare.

matteocontrini · on Feb 10, 2021

Imgur uses Fastly, which I don't think can be classified as low cost.

kingnothing · on Feb 9, 2021

BlueTemplar · on Feb 9, 2021

Even though ad-blockers are ever more common ? (I didn't even think about ads, since I never saw them on imgur !)

robotnikman · on Feb 10, 2021

They have an app which is ad supported, and they are trying to become of a social network like reddit by requiring you to login for more features, and adding upvotes and other interactivity.

kingnothing · on Feb 9, 2021

It's the same way Google pulled in $180B last year and Facebook made $86B.

BlueTemplar · on Feb 10, 2021

AFAIK both of these are using methods that are either deemed acceptable by adblockers (like how Google was using plain blue/black text instead of stroboscopic gif banners), and/or adblockers have trouble with them because they come from the same source ?

ConcreteGidget · on Feb 9, 2021

Yes, but only 10 million times.

guessbest · on Feb 9, 2021

You got to have hotlink protection on when you are hosting memes. I've learned this the hard way, too.

globular-toast · on Feb 9, 2021

Remember when some sites would send you a shock image instead of the one you were expecting if it detected hot linking? I don't miss that.

mcintyre1994 · on Feb 9, 2021

There's a site that's occasionally posted in comments here which is apparently run by someone who hates HN because they serve some image I can't remember to readers from here.

alickz · on Feb 9, 2021

Yeah it was posted not long ago. When it sees HN as the referrer it shows a picture of a testicle in an egg cup.

m463 · on Feb 9, 2021

The "traditional" way of fixing this would be a goatse.cx redirect of the image.

I'm sure there is a more enlightened fix.

sangnoir · on Feb 9, 2021

...or sending that image[1] jwz sends back upon detecting HN in the referer. I bet they'll find the app in a matter of hours, or at least reduce the traffic drastically.

1. https://www.jwz.org NSFW!

lxe · on Feb 9, 2021

Just learned that this person owns DNA Lounge (and pizza?), and is a founder (early contributor?) of Netscape and Mozilla.org. I've lived and worked in that particular area of SF for years and haven't known this.

alsetmusic · on Feb 9, 2021

One of my company's clients has a beautiful office right above DNA Lounge (well, across the street or just adjacent - it's been a while and I've only been there once). They told me they can hear sound checks from their rooftop patio.

hedora · on Feb 9, 2021

Also, jwz is responsible for xscreensaver.

m463 · on Feb 9, 2021

netscape used to display a spinning compass when you put about:jwz in the title bar

other good ones were about:1994 and about:mozilla

hey, about:mozilla still works in firefox

eythian · on Feb 9, 2021

about:robots also works in Firefox, I know it's been there for a long time but I have no idea if it was ever in Netscape.

kbrosnan · on Feb 9, 2021

about:robots is from the early Firefox releases. Pretty sure it is from Firefox 3.0 development as you can find the same robot in images when searching for Firefox Gran Paridiso Robot.

https://www.google.com/search?q=firefox+gran+paradiso+robot&...

tingletech · on Feb 9, 2021

there used to be linux based public terminals in DNA lounge too, IIRC

bzb6 · on Feb 9, 2021

This makes me wonder why the hell referer headers are still sent by major browsers, especially to third parties. I can’t think of a single reason that benefits the user.

qwertay · on Feb 9, 2021

Originally it probably just sounded like a cool feature to see what blog linked to you. Now its been around for so long that so much has been programmed to actually use it. If you turn it off you get every anti bot script blowing up on you.

I think browsers did drop the path from it at least.

jessaustin · on Feb 9, 2021

For one thing, examining referer is a common way that a server determines a request is not a hotlink. Sure you can do something more complicated with cookies or whatever, but lots of sites are just using referer and they'll break if the client doesn't send it.

namibj · on Feb 9, 2021

But for that it's enough to send it for same-origin requests. No need to send it cross-origin, except for tracking purposes.

iggldiggl · on Feb 9, 2021

That'd still break the distinction between hotlinking and the user using a bookmark or copy/paste to directly open the URL in question.

h_anna_h · on Feb 9, 2021

Letting the sites distinguish between the two does not seem to be in the interest of the user.

iggldiggl · on Feb 10, 2021

Well, it'd mean that any site blocking hotlinking would also automatically block direct bookmarks/URL entry, too, which isn't really in the "interest of the user" either, I'd say.

brianush1 · on Feb 10, 2021

If Chrome suddenly stopped sending referrer headers, let's be real here, 99% of websites would be fixed within a couple of days at most.

malaya_zemlya · on Feb 9, 2021

if you are making any sort of content or running a website, it is really useful to know how people found you.

sli · on Feb 9, 2021

All I get is a scrolling hex editor looking thing. Maybe that redirect has been disabled?

mey · on Feb 9, 2021

You aren't sending a referrer header (a good thing).

jaredsohn · on Feb 9, 2021

Try from a new profile or incognito.

I saw the described image but after I visited the site directly I couldn't see it any more when redirectly via hacker news. Saw it again when I opened an incognito tab.

pengaru · on Feb 9, 2021

Yep, jwz has had a change of heart and sees today's HN as a born again breath of fresh air.

geoelectric · on Feb 9, 2021

I’m seeing the nut sundae on iOS mobile so I wouldn’t get too happy yet...

murph-almighty · on Feb 9, 2021

For those reticent to click on their work computers but morbidly curious, can someone describe the image?

sangnoir · on Feb 9, 2021

It's a motivational-poster-type image with a white egg holder in the foreground, but instead of an egg, it's holding one exquisitely detailed hairy, caucasian ball[1]. At the top, the title is "HACKER NEWS" and the bottom text is "A DDoS OF FINANCE-OBSESSED MAN-CHILDREN AND BROGRAMMERS"

1. Is there a collective biological term for scrotum and it's contents that is not general like "genitals" is?

Dylan16807 · on Feb 9, 2021

I think he's the only one that uses that? Barely even worth mentioning in comparison.

mey · on Feb 9, 2021

A permanent redirect to a non-image page (owned by Wikimedia) may achieve the same thing. Either the calling system can't support a HTML response, or it's a webview in which case you could either report an error or provide a notice. Maybe even ask for donations :)

ed25519FUUU · on Feb 9, 2021

Or just downsample the image to a reasonable size and deal with it. Nothing inherently wrong with having a popular image.

ehwhyreally · on Feb 9, 2021

Yes there is when you are hotlinking. Hotlinking in general is considered theft, you are using someone elses bandwidth and could even ddos the host if you are not caching the response.

concordDance · on Feb 9, 2021

> Hotlinking in general is considered theft

This is a pretty puzzling idea to me. How could linking something be theft?

To explore this, I shall try a metaphor. Imagine you're on a big social media website (lets call it Programmer Olds) which has an oddity in that 99% of its users use adblock. You then post a link to another small (ad supported) website on your Programmer Olds page, causing a large number of people to click through and download the page using large amounts of bandwidth (for no monetary gain to the site) and possible DDOSing the site.

Have you commited theft?

lorenzhs · on Feb 9, 2021

> This is a pretty puzzling idea to me.

That's because you're responding to an entirely different issue. "Hotlinking" isn't linking to something, it's including a resource that is hosted elsewhere. It's putting <img src="https://concordDance.whatever/images/big_image.jpg"> on my website without asking you. Now if my site ends up on the front page of HN, that could cause a lot of traffic to your site, potentially overwhelming your server or increasing your hosting bill. It's not nice, and rightfully frowned upon.

concordDance · on Feb 9, 2021

But from a loss and gain perspective it seems equivalent.

In both cases the site loses bandwidth for no gain due to your actions.

JCharante · on Feb 9, 2021

> causing a large number of people to click through and download the page using large amounts of bandwidth (for no monetary gain to the site)

The difference here is that while a lot of users use adblock, there are some that don't. These users can still see the ads. Additionally even though it's a small website, it may lead to new readers that stick around or the content itself may even be sponsored.

The equivilent to hot linking a picture would be like taking the content of a blog post without really linking to the source, because there's no chance of conversions there. If you're linking to the site itself then there's a reasonable chance that users can convert.

So I suggest that it's theft just because the chances of readers being converted is nil while you're using their bandwidth.

jjk166 · on Feb 10, 2021

Let's say I own a restaurant. Someone comes in and wants a panini. I don't have a panini press, but the restaurant next door does.

If I tell the customer they can go next door to get a panini, I'm not stealing anything. Maybe that restaurant is packed right now and they'ed rather not have an extra customer, but there is a reasonable expectation that they would generally want customers or at least have a means of turning away unwanted customers otherwise.

On the other hand if I break into my neighbor's restaurant, make a panini, then bring it back to my restaurant to serve and make money off of, all without permission from the neighbor, I am most definitely stealing. Even if I doubt the neighbor will mind because he let me come over and make myself a panini once, I can't unilaterally act off that assumption.

SpaceRaccoon · on Feb 11, 2021

Is adblock a form of theft?

yreg · on Feb 9, 2021

No, it's not universally considered a theft. Wikimedia explicitly permits hotlinking[0]. So does xkcd, imgur and tons of other sites.

Of course when someone doesn't want us to hotlink to their assets then don't do it.

[0] https://commons.wikimedia.org/wiki/Commons:Reusing_content_o...

arbitrage · on Feb 9, 2021

it's so easy to mitigate, though, that the fact that one doesn't sorta implies that one might want randos from the internet to use one's resources to view this image.

it's not theft if you leave it out for everyone to use.

JackWins · on Feb 9, 2021

My garden doesn't have a fence, doesn't mean you can host your picnic here.

dylan604 · on Feb 9, 2021

No, but if I wander into your garden and "injure" myself, I can sue you for damages. You will be held negligent for not properly protecting yourself from preventing other people from injuring themself on your property.

zaarn · on Feb 9, 2021

Wikimedia has a User-Agent policy which is being violated here. Hence this is the property owner putting up a sign that says "risk of injury", so if you walk in and injure yourself, you only have to blame yourself for being negligent.

h_anna_h · on Feb 9, 2021

The policy is for how wikipedia will act when encountering clients with certain user-agent headers, not a rule for the clients.

zaarn · on Feb 9, 2021

It's a policy how wikimedia acts when clients lack a user agent header, it's therefore effectively a rule for clients as without a proper UA header, they may be blocked indefinitely.

gokhan · on Feb 9, 2021

Is this something real (in US, most probably)?