Author here but some ideas I was thinking about:
- An open source data pipeline built on top of R2. A way of keeping data on R2/S3 but then having execution handled in Workers/Lambda. Inspired by what https://www.boilingdata.com/ and https://www.bauplanlabs.com/ are doing.
- Related to above but taking data that's stored in the various big data formats (Parquet, Iceberg, Hudi, etc) and generating many more combinations of the datasets and choose optimal ones based on the workload. You can do this with existing providers but I think the cost element just makes this easier to stomach.
- Abstracting some of the AI/ML products out there and choosing best one for the job by keeping the data on R2 and then shipping it to the relevant providers (since data ingress to them is free) for specific tasks.
-