Hacker Newsnew | past | comments | ask | show | jobs | submit | bonaldi's commentslogin

This team really have been thinking about weather a lot, and it makes me very curious about what they’ve created this time.

It’s that depth of thought and expertise that feels missing from most of the vibe-coded launches we’ve seen recently. I actually wouldn’t mind if Acme had vibe coded parts, but I bet they didn’t.


> it makes me very curious about what they’ve created this time

The rainbow and sunset alerts are really cool ideas. I'm now realising that a simple tie-in to astronomical phenomena could prompt a useful notificationa around it e.g. being worth going stargazing that night. I ski–learning that the near-term forecasts just changed would help me change my schedule the day before versus trying and failing the morning of.


I'm almost shocked we don't have a large weather model instead of a language model. Seems right up the alley.

Also I don't get what happened but I think it was AccuWeather or weather underground in the early 2000s where it was to the minute accurate and it seems like it's gotten worse since everywhere.


Google, Microsoft, Huawei, NVIDIA all have AI weather forecast models:

https://deepmind.google/science/weathernext/

https://microsoft.github.io/aurora/intro.html

https://www.huawei.com/en/news/2023/8/pangu-weather-forcast

https://blogs.nvidia.com/blog/nvidia-earth-2-open-models/

A Swiss startup named Jua does this for energy markets. Disclosure: I used to work there.


I think Google's weather models could be called LWMs. They're doing interesting research in this space.

Weather Underground had a very reasonable feed that you could subscribe to back in 2000. I use d then for a cluster of farming websites I built then.

> I'm almost shocked we don't have a large weather model instead of a language model. Seems right up the alley.

We do have such models. A bunch of them actually:

- Google DeepMind's "WeatherNext2" - Microsoft's Aurora - NVIDIA's FourCastNet-3 + Atlas + Climate-in-a-Bottle - ECMWF's AIFS ...

The list goes on. Plenty of small startups have repeated the recipe for building these types of models with their own architectural twist, too.


He was arrested for refusing to allow officers to enter his home on a pre-agreed return visit to discuss the complaints:

https://www.whatdotheyknow.com/request/arrest_of_mr_darren_b...

This is why the Daily Mail causes rolled eyes (along with Spiked and the rest of the right-wing agitprop).


Re-read what you just linked. In the response from the JIMU:

"A 51-year-old man from Aldershot was arrested on suspicion of sending by public communication network an offensive, indecent, obscene, menacing message or matter."

This is the legal basis for the arrest. Without the retweet, police would not have had authority to turn up to his place of residence - twice - and demand entry. No doubt they preferred Brady voluntarily submit himself for interview at the station, but he refused, which I hope we can all agree is the morally correct position. No one should have police turn up outside their house - TWICE - because of a parody retweet.


Why on earth was he legally obligated to have that discussion in the first place?

Those complaints should have been laughed at and ignored.


The law might be a bad one (and probably is) but on balance better that police investigate suspected illegality than don’t. Overall I’d rather be somewhere where even a former royal can be arrested than somewhere the rule of law is optional.

This doesn't feel like good-faith. There are leagues of difference between "what you typed out" when that's in a highly structured compiler-specific codified syntax *expressly designed* as the input to a compiler that produces computer programs, and "what you typed out" when that's an English-language prompt, sometimes vague and extremely high-level

That difference - and the assumed delta in difficulty, training and therefore cost involved - is why the latter case is newsworthy.


> This doesn't feel like good-faith.

When has a semantic "argument" ever felt like good faith? All it can ever be is someone choosing what a term means to them and try to beat down others until they adopt the same meaning. Which will never happen because nobody really cares.

They are hilarious, but pointless. You know that going into it.


Fastmail is the way. These are people for whom email is their job and focus and you get everything that comes with that, including good and responsive customer service.


But their servers are in the US.


So are the email servers used by the recipients of your emails, no? Almost everybody uses gmail, so even of you don't most of your email correspondence is going to end up, or originate from, on gmail servers anyway.


Personally I do not know the last time I wrote to a Gmail-adress, so depending on location and evironment avoiding US-mailservers may be possible.


GDPR applies if you're in the EU regardless, but it would be nice to have it split like bitwarden[.eu].


Because we exist within a market, where the choices of others end up affecting us - if the market "votes" for a competing thing, that might affect the market for the things you care about.

Your car analogy isn't great, but we see a similar dynamic playing out with EV vs combustion, and we did with film-vs-digital cameras. "Don't buy a digital camera if you like film" sure didn't help the film photographers.


This is like "HTML isn't code" again. For non-technical readers, there is their own language, and there is "code" - a bespoke language used solely to instruct machines. If you can't type to the machine in your own language (eg like you can to a chatbot) then you're using code. "The machine" is the device on the desk.

"ls" is code. You type it into the machine's keyboard, and it understands your code and performs that instruction. The statement is not "radically" wrong, it's an oversimplification that both communicates correctly to the lay reader, and to the proficient reader who understands the nuances and why they're irrelevant here.


> Tesonet initially assisted Proton with HR, payroll, and local regulation

Entirely normal behaviour for a competitor to provide “HR assistance”.


I've been part of a European startup that added offices in Asia and the US, and we initially always partnered with local companies to do this. It's mutually beneficial. It allowed us to grow more quickly, and it allowed them to make relatively easy money (and, in our case, to dump some of their shittier employees on us without us knowing).

In Proton's case, they already knew each other because Tesonet had previously offered to provide infrastructure during a DDoS attack against Proton.

So maybe it's a conspiracy, or maybe it's just how things go. You can make up your own mind, but you should provide the facts when you make sinister insinuations.


You know an awful lot of detail about the inner workings of two separate private companies though.


Is it really that shocking that someone on HN would have worked at as many as 2 private companies?


Nor is it shocking that a company with a PR issue would be astroturfing our forum.

The point is: we don't know.


I would assume that if they were astroturfing, they would be smart enough to use more than one account. Given that, I'm inclined to believe that you are part of an astroturfing campaign.


The summary is: if you use someone’s VPN, Tor, etc. you’re just setting yourself up. There is no privacy, and if you act like you want privacy, they’re going to pay more attention to you.


That's what they want you to think.


LOL, now I'm part of the conspiracy. This is all public knowledge.


Then you could provide sources, please?


Here you go: https://www.reddit.com/r/ProtonVPN/comments/8ww4h2/protonvpn...

Here's the Handelsregisterauszug for Proton, which shows ownership: https://www.zefix.admin.ch/en/search/entity/list/firm/118926...

Proton's peering relationships: https://bgp.tools/as/62371#asinfo

I'm not sure what exactly you're looking for.


> Here's the Handelsregisterauszug for Proton, which shows ownership

It doesn‘t. It’s a joint-stock corporation and while the shareholders are registered, the register is not public.


Proton discloses shareholder information here: https://proton.me/support/who-owns-protonmail

But I guess they could be lying.


Them providing information isn't the same as publicly verifiable information.


> Mails are superior in announcing to multiple people

People who are known at time of sending. A slack message can be searched by those joining the team much (much) later, those who move teams, in-house search bots, etc. Mailing lists bridge this gap to some extent, but then you're really not just using email, you're using some kind of external collaboration service. Which undermines the point of "just email".


> > Mails are superior in announcing to multiple people > > People who are known at time of sending. A slack message can be searched by those joining the team much (much) later, those who move teams, in-house search bots, etc.

People use slack search successfully? It's search has to be one of the worst search implementations I have come across. Unless you know the exact wording in the slack message, it is almost always easier to scroll back and find the relevant conversation just from memory. And that says something because the slack engineers in their infinite wisdom (incompetence) decided that messages don't get stored on the client, but get reloaded from the server (wt*!!), so scrolling back to a conversation that happened some days ago becomes an excercise of repeated scroll and wait. Slack is good for instant messaging type conversations (and even for those it quickly becomes annoying because their threads are so crappy), not much else. I wish we would use something else.


How would you search from mail threads you weren't CC'd on?


MS Exchange had sort-of solved that problem with Public Folders. Basically shared email folders across an organization.

The older solution is NNTP/Usenet. I wish we had a modern system like that.


> Mailing lists bridge this gap to some extent, but then you're really not just using email, you're using some kind of external collaboration service. Which undermines the point of "just email".

Mailing lists are just email. They simply add a group archiving system.


thats why online private archives like https://mailarchive.ietf.org/arch/browse/ exist. for a free version, use groups.google.com


you just use a shared inbox for the team


This is being blocked by my corp on the grounds of "newly seen domains". What a world.


Not sure the emotive language is warranted. Message appears to be “if you use robots.txt AND archive sites honor it AND you are dumb enough to delete your data without a backup THEN you won’t have a way to recover and you’ll be sorry”.

It also presumes that dealing with automated traffic is a solved problem, which with the volumes of LLM scraping going on, is simply not true for more hobbyist setups.


I just plain don't understand what they mean by "suicide note" in this case, and it doesn't seem to be explained in the text.

A better analogy would be "Robots.txt is a note saying your backdoor might be unlocked".


The meaning is reasonably clear to me: Robots.txt says "Don't archive this data. When the website dies, all the information dies with it." It's a kind of death pact.


That's not a suicide note, though, in any way I understand it.


It's the inevitable suicide of the data.

Language gets weird when you anthropomorphize abstract things like "data", but I thought it was clever enough. YMMV.


The suicide of the data listed in robots.txt? How? The whole point of the article is they ignore what you have written in your robots.txt, so they'll archive it regardless of what you say.


Correct, they are challenging your written wish for data-suicide.


I also cannot figure out from context what part of this is "suicide".

I don't even think it's a note saying your back door is unlocked? As myself and others shared in a sibling comment thread, we have worked at places that implemented robots.txt in order to prevent bots from getting into nearly-infinite tarpits of links that lead to nearly-identical pages.


> volumes of LLM scraping

FWIW I have not seen a reputable report on the % of web scraping in the past 3 years.

(Wikipedia being a notable exception...but I would guess Wikipedia to see a far larger increase than anything else.)


It's hard because of attribution, but it absolutely is happening at very high volume. I actually got an alert this morning when I woke up from our monitoring tools that some external sites were being scraped. Happens multiple times a day.

A lot of it is coming through compromised residential endpoint botnets.


Even without attribution…seeing bot traffic or general traffic increase


Wikipedia says their traffic increased roughly 50% [1] from AI bots, which is a lot, sure, but nowhere near the amount where you'd have to rearchitect your site or something. And this checks out, if it was actually debilitating, you'd notice Wikipedia's performance degrade. It hasn't. You'd see them taking some additional steps to combat this. They haven't. Their CDN handles it just fine. They don't even both telling AI bots to just download the tarballs they specifically make available for this exact use case.

More importantly, Wikipedia almost certainly represents the ceiling of traffic increase. But luckily, we don't have to work with such coarse estimation, because according to Cloudflare, the total increase from combined search and AI bots in the last year (May 2024 - May 2025), has just been... 18% [2].

The way you hear people talk about it though, you'd think that servers are now receiving DDOS-levels of traffic or something. For the life of me I have not been able to find a single verifiable case of this. Which if you think about it makes sense... It's hard to generate that sort of traffic, that's one of the reasons people pay for botnets. You don't bring a site to its knees merely by accidentally "not making your scraper efficient". So the only other possible explanation would be such a larger number of scrapers simultaneously but independently hitting sites. But this also doesn't check out. There aren't thousands of different AI scrapers out there that in aggregate are resulting in huge traffic spikes [2]. Again, the total combined increase is 18%.

The more you look into this accepted idea that we are in some sort of AI scraping traffic apocalypse, the less anything makes sense. You then look at this Anubis "AI scraping mitigator" and... I dunno. The author contends that one if its tricks is that it not only uses JavaScript, but "modern JavaScript like ES6 modules," and that this is one of the ways it detects/prevents AI scrapers [3]. No one is rolling their own JS engine for a scraper such that they are being blocked from their inability to keep up with the latest ECMAScript spec. You are just using an existing JS engine, all of which support all these features. It would actually be a challenge to find an old JS engine these days.

The entire things seems to be built on the misconception that the "common" way to build a scraper is doing something curl-esque. This idea is entirely based on the google scraper which itself doesn't even work that way anymore, and only ever did because it was written in the 90s. Everyone that rolls their own scraper these days just uses Puppeteer. It is completely unrealistic to make a scraper that doesn't run JavaScript and wait for the page to "settle down" because so many pages, even blogs, are just entirely client-side rendered SPAs. If I were to write a quick and dirty scraper today I would trivially make it through Anubis' protections... by doing literally nothing and without even realizing Anubis exists. Just using standard scraping practices with Puppeteer. Meanwhile Anubis is absolutely blocking plenty of real humans, with the author for example telling people to turn on cookies so that Anubis can do its job [4]. I don't think Anubis is blocking anything other than humans and Message's link preview generator.

I'm investigating further, but I think this entire thing may have started due to some confusion, but want to see if I can actually confirm this before speculating further.

1. https://www.techspot.com/news/107407-wikipedia-servers-strug... (notice the clickbait title vs. the actual contents)

2. https://blog.cloudflare.com/from-googlebot-to-gptbot-whos-cr...

3. https://codeberg.org/forgejo/discussions/issues/319#issuecom...

4. https://github.com/TecharoHQ/anubis/issues/964#issuecomment-...


> It is completely unrealistic to make a scraper that doesn't run JavaScript and wait for the page to "settle down" because so many pages, even blogs, are just entirely client-side rendered SPAs.

I specifically want a search engine that does not run JavaScript, so that it only finds documents that do not require JavaScripts to display the text being searched. (This is not the same as excluding everything that has JavaScripts; some web pages use JavaScripts but can still display the text even without it.)

> Meanwhile Anubis is absolutely blocking plenty of real humans, with the author for example telling people to turn on cookies so that Anubis can do its job [4]. I don't think Anubis is blocking anything other than humans and Message's link preview generator.

These are some of the legitimiate problems with Anubis (and this is not the only way that you can be blocked by Anubis). Cloudflare can have similar problems, although its working is a bit different so it is not exactly the same working.


> I specifically want a search engine that does not run JavaScript, so that it only finds documents that do not require JavaScripts to display the text being searched. (This is not the same as excluding everything that has JavaScripts; some web pages use JavaScripts but can still display the text even without it.)

Sure... but off-topic, right? AI companies are desperate for high quality data, and unlike search scrapers, are actually not supremely time sensitive. That is to say, they don't benefit from picking up on changes seconds after they are published. They essentially take a "snapshot" and then do a training run. There is no "real-time updating" of an AI model. So they have all the time in the world to wait for a page to reach an ideal state, as well as all the incentive in the world to wait for that too. Since the data effectively gets "baked into the model" and then is static for the entire lifetime of the model, you over-index on getting the data, not getting fast, or cheap, or whatever.


Hi, main author of Anubis here. How am I meant to store state like "user passed a check" without cookies? Please advise.


If the rest of my post is accurate, that's not the actual concern, right? Since I'm not sure if the check itself is meaningful. From what is described in the documentation [1], I think the practical effect of this system is to block users running old mobile browsers or running browsers like Opera Mini in third world countries where data usage is still prohibitively expensive. Again, the off-the-shelf scraping tools [2] will be unaffected by any of this, since they're all built on top of Puppeteer, and additionally are designed to deal with the modern SPA web which is (depressingly) more or less isomorphic to a "proof-of-work".

If you are open to jumping on a call in the next week or two I'd love to discuss directly. Without going into a ton of detail, I originally started looking into this because the group I'm working with is exploring potentially funding a free CDN service for open source projects. Then this AI scraper stuff started popping up, and all of a sudden it looked like if these reports were true it might make such a project no longer economically realistic. So we started trying to collect data and concretely nail down what we'd be dealing with and what this "post-AI" traffic looks like.

As such, I think we're 100% aligned on our goals. I'm just trying to understand what's going on here since none of the second-order effects you'd expect from this sort of phenomenon seem to be present, and none of the places where we actually have direct data seem to show this taking place (and again, Cloudflare's data seems to also agree with this). But unless you already own a CDN, it's very hard to get a good sense of what's going on globally. So I am totally willing to believe this is happening, and am very incentivized to help if so.

EDIT: My email is my HN username at gmail.com if you want to schedule something.

1. https://anubis.techaro.lol/docs/design/how-anubis-works

2. https://apify.com/apify/puppeteer-scraper


Cloudflare Turnstile doesn't require cookies. It stores per-request "user passed a check" state using a query parameter. So disabling cookies will just cause you to get a challenge on every request, which is annoying but ultimately fair IMO.


Doesn't Wikipedia offer full tarballs?

This would imaginably put some downward pressure on scraper volume.


From the first paragraph in my comment:

> You'd see them taking some additional steps to combat this. They haven't. Their CDN handles it just fine. They don't even both telling AI bots to just download the tarballs they specifically make available for this exact use case.

Yes, they do. But they aren't in a rush to tell AI companies this, because again, this is not actually a super meaningful amount of traffic increase for them.


I don't think you understand the purpose of Anubis. If you did then you'd realize that running a web browser with JS enabled doesn't bypass anything.


By bypass I mean "successfully pass the challenge". Yes, I also have to sit through the Anubis interstitial pages, so I promise I know it's not being "bypassed". (I'll update the post to remove future confusion).

Do you disagree that a trivial usage of an off-the-shelf puppeteer scraper[1] has no problem doing the proof-of-work? As I mentioned in this comment [2], AI scrapers are not on some time crunch, they are happy to wait a second or two for the final content to load (there are plenty of normal pages that take longer than the Anubis proof of work does to complete), and also are unfazed by redirects. Again, these are issues you deal with normal everyday scraping. And also, do you disagree with the traffic statics from Cloudflare's site? If we're seeing anything close to that 18% increase then it would not seem to merit user-visible levels of mitigation. Even if it was 180% you wouldn't need to do this. nginx is not constantly on the verge of failing from a double digit "traffic spike".

As I mentioned in my response to the Anubis author here [3], I don't want this to be misinterpreted as a "defense of AI scrapers" or something. Our goals are aligned. The response there goes into detail that my motivation is that a project I am working on will potentially not be possible if I am wrong and this AI scraper phenomenon is as described. I have every incentive in the world to just want to get to the bottom of this. Perhaps you're right, and I still don't understand the purpose of Anubis. I want to! Because currently neither the numbers nor the mitigations seem to line up.

BTW, my same request extends to you, if you have direct experience with this issue, I'd love to jump on a call to wrap my head around this.

My email is my HN username at gmail.com if you want to reach out, I'd greatly appreciate it!

1. https://news.ycombinator.com/item?id=44944761

2. https://apify.com/apify/puppeteer-scraper

3. https://news.ycombinator.com/item?id=44944886


Or major web properties for that matter.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: