tomnicholas1's comments

tomnicholas1 · 2024-12-20T21:10:05 1734729005

No not necessarily - it will keep growing hotter until the black body radiation emitted by the probe matches the power of the radiation hitting the probe. Then it will stay at constant temperature.

It's a standard undergraduate problem to work out what this equilibrium temperature is for a flat plate at a distance from the sun equal to the Earth's orbital radius.

Interestingly the result is only a few 10's of degrees less than the average temperature of the real Earth - the difference is due to the Greenhouse Effect.

For the probe one could easily do the maths but I could believe that at 4 million miles that equilibrium temperature is 2,500F.

thisisbrians · 2024-12-22T03:00:20 1734836420

i definitely can't do the math myself, but excellent retort. thank you for the nuance

tomnicholas1 · 2024-12-20T13:31:13 1734701473

> It is sort of analogous to what GribStream is doing already.

The difference is presumably that you are doing some large rechunking operation on your server to hide from the user the fact that the data is actually in multiple files?

Cool project btw, would love to hear a little more about how it works underneath :)

ElPeque · 2024-12-20T13:46:44 1734702404

Yeah, exactly.

I basically scrape all the grib index files to know all the offsets into all variables for all time. I store that in clickhouse.

When the API gets a request for a time range, set of coordinates and a set of weather parameters, first I pre-compute the mapping of (lat,lon) into the 1 dimensional index in the gridded data. That is a constant across the whole dataset. Then I query the clickhouse table to find out all the files+offset that need to be processed and all of them are queued into a multi-processing pool. And then processing each parameter implies parsing a grib file. I wrote a grib2 parser from scratch in golang so as to extract the data in a streaming fashion. As in... I don't extract the whole grid only to lookup the coordinates in it. I already pre-computed the index, so I can just decode every value in the grid in order and when I hit and index that I'm looking for, I copy it to a fixed size buffer with the extracted data. When You have all the pre-computed indexes then you don't even need to finish downloading the file, I just drop the connection immediately.

It is pretty cool. It is running in very humble hardware so I'm hoping I'll get some traction so I can throw more money into it. It should scale pretty linearly.

I've tested doing multi-year requests and the golang program never goes over 80Mb of memory usage. The CPUs get pegged so that is the limiting factor.

Grib2 complex packing (what the NBM dataset uses) implies lots of bit-packing. So there is a ton more to optimize using SIMD instructions. I've been toying with it a bit but I don't want to mission creep into that yet (fascinating though!).

I'm tempted to port this https://github.com/fast-pack/simdcomp to native go ASM.

tomnicholas1 · 2024-12-20T19:38:13 1734723493

That's pretty cool! Quite specific to this file format/workload, but this is an important enough problem that people might well be interested in a tailored solution like this :)

ElPeque · 2024-12-21T02:12:53 1734747173

Thank you!

tomnicholas1 · 2024-12-20T13:29:25 1734701365

> Why is weather data stored in netcdf instead of tensors or sparse tensors?

NetCDF is a "tensor", at least in the sense of being a self-describing multi-dimensional array format. The bigger problem is that it's not a Cloud-Optimized format, which is why Zarr has become popular.

> Also, SQLite supports virtual tables that can be backed by Content Range requests

The multi-dimensional equivalent of this is "virtual Zarr". I made this library to create virtual Zarr stores pointing at archival data (e.g. netCDF and GRIB)

https://github.com/zarr-developers/VirtualiZarr

> xarray and thus NetCDF-style multidimensional arrays in WASM in the browser with HTTP Content Range requests to fetch and cache just the data requested

Pretty sure you can do this today already using Xarray and fsspec.

tomnicholas1 · 2024-11-17T21:20:29 1731878429

That's partly because the warming experienced over land can be ~50-100% larger than the globally-averaged warming, with the temperatures over the oceans increasing more slowly to make up the difference.

https://www.carbonbrief.org/guest-post-why-does-land-warm-up...

tomnicholas1 · 2024-10-20T13:11:58 1729429918

You should look at Icechunk. Your imaging data is structured (it's a multidimensional array), so it should be possible be to represent it as "Virtual Zarr". Then you could commit it to an Icechunk store.

https://earthmover.io/blog/icechunk

tomnicholas1 · 2024-10-20T13:04:12 1729429452

If you're wondering this you should look at Icechunk too, which was open-sourced just this week. It's Apache Iceberg but for multidimensional data (e.g. Zarr).

https://earthmover.io/blog/icechunk

https://news.ycombinator.com/item?id=41850352

tomnicholas1 · 2024-08-19T20:13:32 1724098412

So the equivalent of these balloons in oceanography are called ARGO floats, which similarly cannot be driven laterally but can control their own depth like a submarine. So far millions of timeseries have been collected across the world ocean using these floats.

https://argo.ucsd.edu/

One difference though is that the ARGO floats are unfortunately not recycled, and just wash up on various beaches. (I'm curious whether you think you can realistically collect many of these mini balloons?)

If you do want to control the lateral position of fleets of sensors, oceanographers also now have "gliders", which are basically small powered drone submarines. These are used by a few groups, but most of the gliders in the world are operated by the US Navy, who launch them out of torpedo tubes to survey local ocean conditions (which is badass).

https://oceanservice.noaa.gov/facts/ocean-gliders.html

The recorded measurements present an interesting data assimilation challenge - they record data along 3D trajectories (4D including time), sampling jagged and twisting lines through the 4D space. But we normally prefer to think of weather/ocean data as gridded, so you need to interpolate the trajectory data onto the grid, whilst keeping the result physically-consistent. Oceanographers use systems like ECCO for ocean state estimation, which effectively find the "ocean of best fit" to various data sources.

https://www.ecco-group.org/

Interestingly ECCO uses an auto-differentiable form of the governing equations for the ocean flow to ensure that updates stay physically consistent. This works by using a differentiable ocean fluid model called [MITgcm](https://github.com/MITgcm/MITgcm) to perform runs which match experimental data as closely as possible, and minimizing a loss function through gradient descent. The gradient is of a loss function (error) with respect to model input parameters + forcings, which is calculated by running MITgcm in adjoint mode - i.e. automatic differentation. Therefore this approach is sort of ML before it was cool (they were doing all this well before the new batch of AI weather models). See slides 9-18 of this deck for a nice explanation

https://firebasestorage.googleapis.com/v0/b/firescript-577a2...

The trajectory data is also interesting because it's sort of tabular, but also you often want to query it in an array-like 4D space. You could also call it a "ragged" array. We have nice open-source tools for gridded (non-ragged) arrays (e.g. xarray and zarr, and the pangeo.io project) but I think we could provide scientists with better tools for trajectory-like data in general. If that seems relevant to you I would love to chat.

P.S: Sorceror seems awesome, and I applaud you for working on something hard-tech & climate-tech!

tndl · 2024-08-19T20:38:36 1724099916

This is super interesting, I'd never come across ARGO before. Data assimilation is a similar problem for our data, and there currently exist systems for assimilating weather balloon observations into gridded reanalysis data (https://www2.mmm.ucar.edu/wrf/users/). One thing we believe, however, is that the reanalysis step in weather forecasting is unnecessary in the long term, and that future (ML) weather models will eventually opt to generate predictions based on un-assimilated raw data and will get better results in doing so.

That being said, trajectory-based data tooling could be super interesting to us. Let's definitely chat: austin@sorcerer.earth

And re: recovery, we're pretty confident we'll be able to recover the majority of our systems. Being in the air has the advantage that we can choose to 'beach' ourselves in a specific location, rather than the first place we run across land like with the buoys. At his previous company, Alex wrote a prediction engine able to get similar balloon systems to land in a predicted 1kmx1km zone for recovery

counters · 2024-08-19T21:17:33 1724102253

> One thing we believe, however, is that the reanalysis step in weather forecasting is unnecessary in the long term, and that future (ML) weather models will eventually opt to generate predictions based on un-assimilated raw data and will get better results in doing so.

The idea that we'll be able to run ML weather models using "raw" observations and skip or implicitly incorporate an assimilation is spot-on - there's been an enormous shift in the AI-weather community over the past year to acknowledge that this is coming, and very soon.

But... in your launch announcement you seem to imply that you're already using your data for building and running these types of models. Can you clarify how you're actually going to be using your data over the next 12-24 months while this next-generation AI approach matures? Are you just doing traditional assimilation with NWP?

Also, to the point about reanalysis - that's almost certainly not correct. There are massive avenues of scientific research which rely on a fully-assimilated and reconciled, corrected, consistent analysis of atmospheric conditions. AI models in the form of foundation models or embeddings might provide new pathways to build reanalysis products, but they are a vital and critical tool and will likely be so for the foreseeable future.

bigveech · 2024-08-19T22:33:48 1724106828

> There are massive avenues of scientific research which rely on a fully-assimilated and reconciled, corrected, consistent analysis of atmospheric conditions.

That’s a good point! In fact, the outputs for observation based foundational models will likely include a "reanalysis-like" step for the final output.

Regarding the next 6-12 months, we will be integrating our data with traditional NWP models and utilizing AI for forecasting. We've developed a compact AI model that can directly assimilate our "ground truth" data with reanalysis, specifically for use in AI forecasting models.

Once we have hundreds of systems deployed, we'll use the collected observations, combined with historical publicly available data, to train a foundational model that will directly predict specific variables based on raw observations.