It's great that they added iceberg support I guess, but it's a shame that they also removed S3 Select. S3 Select wasn't perfect. For instance the performance was no where near as good as using DuckDB to scan a parquet file, since duck is smart, and S3 Select does a full table scan.
But S3 Select is nearly way cheaper that the new iceberg support. So if your needs are only for reading one parquet snapshot, we no need to do updates, then this change is not welcome.
Great article though, and I was pleased to see this at the end:
> We’ve invested in a collaboration with DuckDB to accelerate Iceberg support in Duck,
I think there's a narrow window, at least in some programming languages, when environment variables can be set at the start of a process. But since it's global shared state, it needs to be write (0,1) and read many. No libraries should set them. No frameworks should set them, only application authors and it should be dead obvious to the entire team what the last responsible moment is to write an environment variable.
I am fairly certain that somewhere inside the polyhedron that satisfies those constraints, is a large subset that could be statically analyzed and proven sound. But I'm less certain if Rust could express it cleanly.
Your process can be started in a paused state by a debugger, have new libraries and threads injected into it, and then resumed before a single instruction of your own binary has been executed... and debuggers are far from the only thing that will inject code into your processes. If you're willing to handwave that, pre-main constructors, etc. away, you can write something like this easily enough:
struct BeforeEnvFreeze(());
struct AfterEnvFreeze(());
impl BeforeEnvFreeze {
pub fn new() -> Self { /* singleton check using a static AtomicBool or something */ Self(()) }
pub fn freeze(self) -> AfterEnvFreeze { AfterEnvFreeze(()) }
pub fn set_env(&self, ...) { ... }
}
impl AfterEnvFreeze {
pub fn spawn_thread(&self, ...) { ... }
}
fn main() {
let a = BeforeEnvFreeze::new();
a.set_env(...);
a.set_env(...);
//b.spawn_thread(...); // not available
let b = a.freeze(); // consumes `a`
b.spawn_thread(...);
//a.set_env(...); // not available
}
Exercises left to the reader:
• Banning access to the relevant bits of Rust's stdlib, libc, etc. as a means of escaping this "safe" abstraction
• Conning your lead developer into accepting your handwave
• Setting up the appropriate VCS alerts so you have a chance to NAK "helpful" "utility" pull requests that undermine your "protections"
And of course, this all remains a hackaround for POSIX design flaws - your engineering time might be better spent ensuring or enforcing your libc is "fixed" via intentional memory leaks per e.g. https://github.com/bminor/glibc/commit/7a61e7f557a97ab597d6f... , which may ≈fix more than your Rust programs.
I agree that libraries certainly should not. But why would writing be the right choice ever, even for applications? Doesn't it make far more sense to use env to create in some better-typed global configuration object, filling any gaps with defaults, then use that?
I'd go further and say env should always be read-only and libraries should never even read env vars.
> I think there's a narrow window, at least in some programming languages, when environment variables can be set at the start of a process.
I mean, based on this issue I would say the only safe time is "at the start of the program, before any new threads may have been created".
But again, as others have said, there's no good reason I'm aware of to set environment variables in your own process, and when you spawn a new process you can give it its own environment with any changes you want.
When using C++ I wanted programs to have a function that was called before main() and set up things that got sealed afterwards, like parsing command-line-arguments, the environment variables, loading runtime libraries, and maybe look at the local directory, but I'm not sure if it'll be a useful and meaningful distinction unless you restructure way too many things.
I remember that on the Fuchsia kernel programs needed to drop capabilities at some point, but the shift needed might be a hard sell given things already "work fine".
Everyone thinks they are can be the first to do something, and that there is surely nothing that will happen before them. Unfortunately everyone save for one is mistaken. Sometimes that chosen one is not even consistent.
This is one of the problems with Singletons. Especially if they end up interacting or being composed.
In Java you’d have the static initializers run before the main method starts. And in some languages that spreads to the imports which is usually where you get into these chicken and egg problems.
One of the solutions here is make the entry point small, and make 100% of bootstrapping explicit.
Which is to say: move everything into the main method.
I’ve seen that work. On the last project it got a little big, and I went in to straighten out some bits and reduce it. But at the end anyone could read for themselves the initialization sequence, without needing any esoteric knowledge.
I know I can fool around with crt0, but I'm not sure how much you can really use that if you plan to use libraries that may depend on global `static` things that get created as they are linked in before `main` starts.
Maybe it's possible, but if I need to review every library (and hope they don't break my assumptions later) I think I lost on building this separation in practical way.
You needn't go "hacky" for this; constructors for global/static variables are called before main(). But then, the underlaying linker support is usually "trivially exposed" (using the constructor attribute in gcc/clang, say).
This (obviously?) isn't "110%" perfect as the order of the constructor calls for several such objects may not be well-defined, and were they to create threads (who am I to suggest being reasonable ...) you end up with chicken-egg situations again.
JavaScript only just got top level async. So what I saw happen is that files that do their own background tasks start those either in their constructor or lazily in the case of static functions.
There was one place and only one place where we violated that, and it was in code I worked on. It was a low level module used everywhere else for bootstrapping, and so we collectively decided to do something sneaky in order to avoid making the entire code base async.
And while I find that most of the time people can handle making one special case for a rule, it was a complicated system and even “we” screwed it up occasionally for a good long while.
The problem was we needed to make a consul call at startup and the library didn’t have a synchronous way to make that call. So all bootstrapping code had to call a function and await it, before loading other things that used that module. At the end we had about a dozen entry points (services, dev and diagnostic tools). And I always got blamed because nobody seemed to remember we decided this together.
I hate singletons. And I ended up with one of only two in the whole project, and that hatred still wasn’t enough to prevent hitting the classical problems with singletons.
That does happen. Still there is a reason many avoid it. Probably every significant project has places where they do that. Still if it isn't in main it is always a little "magic" and that means hard to understand how the program works. (or worse randomly doesn't work because something is used before it is initialized)
> When using C++ I wanted programs to have a function that was called before main() and set up things that got sealed afterwards, like parsing command-line-arguments, the environment variables, loading runtime libraries, and maybe look at the local directory, but I'm not sure if it'll be a useful and meaningful distinction unless you restructure way too many things
If you're only reading environment variables you have no problem, though. It's only if you try to change them that it causes issues.
For setting, "only set environment variables in the Bash script that starts your program" might be a good rule.
The "cross platform" way of setting the environment is to set it "from outside" of the program - meaning, through the executor, whether that's the shell or the container runtime or even the kernel commandline if you insist to rewrite init in rust/go/zig/...
It can be as-easy-as spawning your process via "env -i VAR1=... ... myprogram ..." - and given this also clears the dangers of env-insertion exploits, it's good practice.
(the argument that the horses have long bolted with respect to "just do the right think ok?!" here holds some water. I'm of the generation though where people on the internet could still tell each other they were wrong, and I assert that here; you're wrong if you believe a non-threadsafe unix interface is a bug. No matter what kind of restrictions around its use that means. You're still wrong if you assume the existence of such restrictions is a bug)
Some of the docker containers I made ended up having a bash shell as the entry point and I moved most of the environment variable init out of the code and into the script. But in dev sandbox some of that code runs without the script, so it was still a headache.
>Note that Java, and the JVM, doesn't allow changing environment variables. It was the right choice, even if painful at times.
Not sure why would it be considered painful. Imo, use of setenv to modify your own variable, the definition of setenv is thread unsafe. So unless running a single threaded application it'd never make sense to call it.
Java does support running child processes with a designated env space (ProcessBuilder.environment is a modifiable map, copied from the current process), so inability to modify its own doesn't matter.
Personally I have never needed to change env variables. I consider them the same as the command line parameters.
> Java doesn't even allow to change the working directory also due to potential multi-threading problems.
Linux and macOS both support per-thread working directory, although sadly through incompatible APIs.
Also, AFAIK, the Linux API can't restore the link between the process CWD and thread CWD once broken – you can change your thread's CWD back to the process CWD, but that thread won't pick up any future changes to the process CWD. By contrast, macOS has an API call to restore that link.
That would be so much wasted engineering effort. The actual solution is simple: read what you need from env, and pass it as parameters to the functions you want to. The values of what you have read can be changed... and if you really, really want start a child process with a modified env.
if you really wish - you can change the bootstrap path and allow changing env() for whatever reason you want to (likely via copy on write). If you don't wish to do that feel free to spawn a child process with whatever env you desire, then redirect/join sys in/our/err (0/1/2)
Those are trivial things in around 100 lines of code and have been available since System.getenv() got back (it used to be deprecated and non-functional prior Java 1.5 or 2004)
So the story goes (I'm probably just regurgitating some marketing)...
The founder has a software background, and built out a software service to track energy usage and deliver data-driven pricing plans. But when he tried to sell it to the energy companies, it didn't find much demand for the software. So they setup their own energy supplier to prove how good it was. They now sell their software in mulitple countries.
As a customer, not only can I get an API key and retrive usage data. But I can also use an API to see what other pricing plans are available. And in the mobile app I get to the minute usage reporting.
I've also experimented with duckdb whilst on a databricks project, and did also think "we could do this whole thing with duckdb and a large EC2 instance spun up for an few hours a week".
But of course duckdb was new then, and you can't re-architect on a hunch. Thanks for the aricle.
The issue with maritime insurance, is that ocean going vessels spend a great deal of time in other countries from where the owner registers the ship, and also time in international waters, which are not governed by the laws of any country.
So you average insurance company is not interested in providing insurance.
If you want to read more about the history of martime insurance, you could start with the history of Lloyds of London.
Sometimes it is more important to work on proving you have a viable product and market to sell it in before you optimise.
On the outside we can’t be sure. But it’s possible that they took the right decision to go with a naïve implementation first. Then profile, measure and improve later.
But yes the hole idea of running a headless web browser to get run JavaScript to get access to a video stream is a bit crazy. But I guess that’s just the world we are in.
But S3 Select is nearly way cheaper that the new iceberg support. So if your needs are only for reading one parquet snapshot, we no need to do updates, then this change is not welcome.
Great article though, and I was pleased to see this at the end:
> We’ve invested in a collaboration with DuckDB to accelerate Iceberg support in Duck,
reply