More

zbentley · 2025-01-28T23:18:52 1738106332

> If you just blindly write the event, then you always know that fact about customer 1234 in any future query.

Unless the write times out or the DB is down for maintenance when the event arrives.

Sure, you could block acknowledgement of the event until the event log receives it, but can your DB handle synchronous write volume from however many people are out there sending events? If your RPC servers listening for events are unavailable, do you trust event senders to retry when they're back?

Down that road lies "let's put every event in a fast message bus with higher insert volume and availability than the database, and feed that into the DB asynchronously", hence Kafka and friends.

mrkeen · 2025-01-29T09:37:43 1738143463

> Down that road lies "let's put every event in a fast message bus with higher insert volume and availability than the database, and feed that into the DB asynchronously", hence Kafka and friends.

Yes, this is the way. Which is why it's not

>> Isn't this essentially how a modern transactional database works anyway

and needed a comment.

zbentley · 2025-01-28T22:56:42 1738105002

Debezium is popular in this space, though it does bring more tools into the CDC/CQRS stack: https://debezium.io/

zbentley · 2025-01-24T05:01:59 1737694919

I've worked on (and variously built and ripped out) systems like that, and I end up in the "more trouble than it's worth" camp here. Context-ish things do have considerable benefits, but the costs are also major.

If context isn't uniform and minimal, and people can add/remove fields for their own purposes, the context becomes a really sneaky point of coupling.

Adapting context-ful code from a request-response world to (for example) a parallel-batch-job world or continuous stream consumer world runs into friction: a given organization's idioms around context usually started out in one of those worlds, and don't translate well to others. If I'm a worker thread in a batch job working on a batch of "move records between tenant A and tenant B" work, but the business logic methods I'm calling to retrieve and store records are sensitive to a context field that assumes it'll be set in a web request (and that each web request will be made for exactly one tenant), what do I do? If your business is always going to be 99% request/response code, then sure, hack around the parts that aren't. But if your business does any continuous data pipeline wrangling, you rapidly end up with either a split codebase (request-response contextful vs "things that are only meant to be called from non-request-response code") or really thorny debugging around context issues in non-request-response code.

If you choose to deal with context thread-locally (or coroutine locally, or something that claims to be both but is in reality neither--looking at you, "contextlib"), that sneaky context mutation by the concurrency system multiplies the difficulties in reasoning about context behavior.

> it violates various architectural principles, for example, from the point of view of our business logic, there's no such thing as "tenant ID"

I think a lot of people lose sight of how incredibly useful explicit dependency management is because it's classed as "tight coupling" and "bad architecture" when it's nothing of the sort. I blame 2010s Java and dependency inversion/injection brainrot.

Business logic is rarely pure; most "business" code functions as transforming glue between I/O. The behavior of the business logic is fundamentally linked to _where_ (and often _how_ as well--e.g. is it in a database transaction?) it interacts with datastores and external services. "Read/write business code as if it didn't have side effects" is not a good approach if code is _primarily occupied with causing side effects_--and, in commercial software engineering, most of it is!

From that perspective, explicitly passing I/O system handles, settings, or whatnot everywhere can be a very good thing: when reading complex business logic, the presence (or absence) of those dependencies in a function call tells you what parts of the system will (or can) conduct I/O. That provides at-a-glance information into where the system can fail, where it can lag, what services or mocks need to be running to test a given piece of code, and at a high level what data flows it models (e.g. if a big business logic function receives an HTTP client factory for "s3.amazonaws.com/..." and a database handle, it's a safe bet that the code in question broadly moves data between S3 and the database).

While repetitive, doing this massively raises the chance of catching certain mistakes early. For example, say you're working on a complex businessy codebase and you see a long for-loop around a function call like "process_record(record, database_tenant_id, use_read_replica=True, timeout=5)"? That's a strong hint that there's an N+1 query/IO risk in that code, and the requirement that I/O system dependencies be passed around explicitly encodes that hint _semantically_.

That kind of visibility is vastly superior to "pure" and uncluttered business logic that relies on context/lexicals to plumb IO around. Is the pure code less noisy and easier to interpret? Sure, but the results of that interpretation are so much less valuable as to be actively misleading.

Put another way: business logic is concerned with things like tenant IDs and database connections; obscuring those dependencies is harmful. Separation of concerns means that good business code is code that avoids mutating, or making decisions based on, the dependencies it receives--not that it doesn't receive them/use them/pass them around.

zbentley · 2025-01-19T14:14:58 1737296098

I think this isn't shoddy engineering; I think you disagree with a design tradeoff that was consciously made here.

If CLIs are designed around facilitating the easiest means of editing the most commonly-edited word, different programs will end up with semantically very different CLIs. In some programs, the "noun" (e.g. file path) will be the thing most frequently edited. In some others, like systemctl, it'll be the "verb", like start/stop. That means that different programs designed around this principle will be semantically extremely inconsistent with each other.

On the other hand, if consistency of "base_command sub_command --subcommand-arg sub_sub_command --option argument" is taken as the guiding principle, many different tools will act semantically similarly. This enables unfamiliar users (talking first-time users or people who have just opened the manpage for the first time) to intuit the right way to present commands to those tools, rather than expecting them to memorize (or re-look-up as needed) each one's specific semantics.

While there's merit to both, I think the second one--systemd's CLI approach--is better. Put another way: the user-centric and consistent (with other applications) design are sometimes at odds in this way. A tool that is hyper-focused on optimizing the most common tasks for its (power) users risks oddball status and hostility to new users. It's important to pick a point on the spectrum that appropriately balances those two concerns. Python ("There should be one-- and preferably only one --obvious way to do it") and the Apple HIG ecosystem both understand the truth of this: that it is sometimes necessary to trade away efficiency for familiarity/consistency. There's a reason Perl languished while Python grew.

Like, I get it. I've been a sysadmin/devops for decades, and the paper cuts add up. But it's easy to forget the paper cuts I'm not getting: modern tools are (generally, with exceptions) more consistent; there are fewer commands that are resistant to memorization and need to be looked up each time; fewer commands that lead to questions like "wait, is it 'cmd help subcommand', 'cmd --help subcommand', or 'cmd subcommand --help'? Am I going to have to Google how to get the help output of this command?"

zbentley · 2025-01-19T13:48:46 1737294526

Because I switch computers (often, for work), and what's muscle memory on one then becomes "command not found" on others without the alias. Many of those computers I don't control and can't say "well, everyone should just run my aliases".

Because I have to share commands with other people who are troubleshooting their own systems, and copy/paste from history becomes useless if I have specific aliases.

Because someday I or someone will want to script these interactions, and aliases are not available in subprocesses.

zbentley · 2025-01-14T13:37:45 1736861865

I use ZFS for boot and storage volumes on my main workstation, which is primarily that--a workstation, not a server or NAS. Some benefits:

- Excellent filesystem level backup facility. I can transfer snapshots to a spare drive, or send/receive to a remote (at present a spare computer, but rsync.net looks better every year I have to fix up the spare).

- Unlike other fs-level backup solutions, the flexibility of zvols means I can easily expand or shrink the scope of what's backed up.

- It's incredibly easy to test (and restore) backups. Pointing my to-be-backed-up volume, or my backup volume, to a previous backup snapshot is instant, and provides a complete view of the filesystem at that point in time. No "which files do you want to restore" hassles or any of that, and then I can re-point back to latest and keep stacking backups. Only Time Machine has even approached that level of simplicity in my experience, and I have tried a lot of backup tools. In general, backup tools/workflows that uphold "the test process is the restoration process, so we made the restoration process as easy and reversible as possible" are the best ones.

- Dedup occasionally comes in useful (if e.g. I'm messing around with copies of really large AI training datasets or many terabytes of media file organization work). It's RAM-expensive, yes, but what's often not mentioned is that you can turn it on and off for a volume--if you rewrite data. So if I'm looking ahead to a week of high-volume file wrangling, I can turn dedup on where I need it, start a snapshot-and-immediately-restore of my data (or if it's not that many files, just cp them back and forth), and by the next day or so it'll be ready. Turning it off when I'm done is even simpler. I imagine that the copy cost and unpredictable memory usage mean that this kind of "toggled" approach to dedup isn't that useful for folks driving servers with ZFS, but it's outstanding on a workstation.

- Using ZFSBootMenu outside of my OS means I can be extremely cavalier with my boot volume. Not sure if an experimental kernel upgrade is going to wreck my graphics driver? Take a snapshot and try it! Not sure if a curl | bash invocation from the internet is going to rm -rf /? Take a snapshot and try it! If my boot volume gets ruined, I can roll it back to a snapshot in the bootloader from outside of the OS. For extra paranoia I have a ZFSBootMenu EFI partition on a USB drive if I ever wreck the bootloader as well, but the odds are that if I ever break the system that bad the boot volume is damaged at the block level and can't restore local snapshots. In that case, I'd plug in the USB drive and restore a snapshot from the adjacent data volume, or my backup volume ... all without installing an OS or leaving the bootloader. The benefits of this to mental health are huge; I can tend towards a more "college me" approach to trying random shit from StackOverflow for tweaking my system without having to worry about "adult professional me" being concerned that I don't know what running some random garbage will do to my system. Being able to experiment first, and then learn what's really going on once I find what works, is very relieving and makes tinkering a much less fraught endeavor.

- Being able to per-dataset enable/disable ARC and ZIL means that I can selectively make some actions really fast. My Steam games, for example, are in a high-ARC-bias dataset that starts prewarming (with throttled IO) in the background on boot. Game load times are extremely fast--sometimes at better than single-ext4-SSD levels--and I'm storing all my game installs on spinning rust for $35 (4x 500GB + 2x 32GB cheap SSD for cache)!

E39M5S62 · 2025-01-15T02:48:52 1736909332

It's great to hear that you're using ZFSBootMenu the way I envisioned it! There's such a sense of relief and freedom having snapshots of your whole OS taken every 15 minutes.

One thing that you might not be aware of is that you can create a zpool checkpoint before doing something 'dangerous' (disk swap, pool version upgrade, etc) and if it goes badly, roll back to that checkpoint in ZFSBootMenu on the Pool tab. Keep in mind though that you can only have one checkpoint at a time, they keep growing and growing, and a rollback is for EVERYTHING on the pool.

zbentley · 2025-01-15T04:44:04 1736916244

Oh, are you zdykstra? If so, thanks for creating an invaluable tool!

> you can create a zpool checkpoint before doing something 'dangerous' (disk swap, pool version upgrade, etc) and if it goes badly, roll back to that checkpoint in ZFSBootMenu on the Pool tab

Good to know! Snapshots meet most of my needs at present (since my boot volume is a single fast drive, snapshots ~~ checkpoints in this case), but I could see this coming in useful for future scenarios where I need to do complex or risky things with data volumes or SAN layout changes.

zbentley · 2025-01-12T22:17:17 1736720237

This article is a good primer on why process isolation is more robust/separated than threads/coroutines in general, though ironically I don't think it fully justifies why process isolation is better for tests as a specific usecase benefitting from that isolation.

For tests specifically, some considerations I found to be missing:

- Given speed requirements for tests, and representativeness requirements, it's often beneficial to refrain from too much isolation so that multiple tests can use/excercise paths that use pre-primed in memory state (caches, open sockets, etc.). It's odd that the article calls out that global-ish state mutation as a specific benefit of process isolation, given that it's often substantially faster and more representative of real production environments to run tests in the presence of already-primed global state. Other commenters have pointed this out.

- I wish the article were clearer about threads as an alternative isolation mechanism for sequential tests versus threads as a means of parallelizing tests. If tests really do need to be run in parallel, processes are indeed the way to go in many cases, since thread-parallel tests often test a more stringent requirement than production would. Consider, for example, a global connection pool which is primed sequentially on webserver start, before the webserver begins (maybe parallel) request servicing. That setup code doesn't need to be thread-safe, so using threads to test it in parallel may surface concurrency issues that are not realistic.

- On the other hand, enough benefits are lost when running clean-slate test-per-process that it's sometimes more appropriate to have the test harness orchestrate a series of parallel executors and schedule multiple tests to each one. Many testing frameworks support this on other platforms; I'm not as sure about Rust--my testing needs tend to be very simple (and, shamefully, my coverage of fragile code lower than it should be), so take this with a grain of salt.

- Many testing scenarios want to abort testing on the first failure, in which case processes vs. threads is largely moot. If you run your tests with a thread or otherwise-backgrounded routine that can observe a timeout, it doesn't matter whether your test harness can reliably kill the test and keep going; aborting the entire test harness (including all processes/threads involved) is sufficient in those cases.

- Debugging tools are often friendlier to in-process test code. It's usually possible to get debuggers to understand process-based test harnesses, but this isn't usually set up by default. If you want to breakpoint/debug during testing, running your tests in-process and on the main thread (with a background thread aborting the harness or auto-starting a debugger on timeout) is generally the most debugger-friendly practice. This is true on most platforms, not just Rust.

- fork() is a middle ground here as well, which can be slow, though mitigations exist, but can also speed things up considerably by sharing e.g. primed in-memory caches and socket state to tests when they run. Given fork()'s sharp edges re: filehandle sharing, this, too, works best with sequential rather than parallel test execution. Depending on the libraries in use in code-under-test, though, this is often more trouble than it's worth. Dealing with a mixture of fork-aware and fork-unaware code is miserable; better to do as the article suggests if you find yourself in that situation. How to set up library/reusable code to hit the right balance between fork-awareness/fork-safety and environment-agnosticism is a big and complicated question with no easy answers (and also excludes the easy rejoinder of "fork is obsolete/bad/harmful; don't bother supporting it and don't use it, just read Baumann et. al!").

- In many ways, this article makes a good case for something it doesn't explicitly mention: a means of annotating/interrogating in-memory global state, like caches/lazy_static/connections, used by code under test. With such an annotation, it's relatively easy to let invocations of the test harness choose how they want to work: reuse a process for testing and re-set global state before each test, have the harness itself (rather than tests by side-effect) set up the global state, run each test with and/or without pre-primed global state and see if behavior differs, etc. Annotating such global state interactions isn't trivial, though, if third-party code is in the mix. A robust combination of annotations in first-party code and a clear place to manually observe/prime/reset-if-possible state that isn't annotated is a good harness feature to strive for. Even if you don't get 100% of the way there, incremental progress in this direction yields considerable rewards.

sunshowers · 2025-01-12T22:28:34 1736720914

But that's exactly it, right — everything you've listed are valid points in the design space, but they require a lot of coordination between various actors in the system. Process per test solves a whole swath of coordination issues.

The post lists out what it would take to make most of nextest's feature set available in a shared-process model. There has been some interest in this, but it is a lot of work for things that come for free.

zbentley · 2025-01-11T15:50:45 1736610645

Rust is one of my favorite new languages, but this is just wrong.

> few implicit casts

Just because it doesn't (often) implicitly convert/pun raw types doesn't mean it has "few implicit casts". Rust has large amounts implicit conversion behavior (e.g. deref coercion, implicit into), and semi-implicit behavior (e.g. even regular explicit ".into()" distances conversion behavior and the target type in code). The affordances offered by these features are significant--I like using them in many cases--but it's not exactly turning over a new leaf re: explicitness.

Without good editor support for e.g. figuring out which "into" implementation is being called by a "return x.into()" statement, working in large and unfamiliar Rust codebases can be just as much of a chore as rawdogging C++ in no-plugins vim.

Like so many Rust features, it's not breaking with specific semantics available in prior languages in its niche (C++); rather, it's providing the same or similar semantics in a much more consciously designed and user focused way.

> lifetimes

How do lifetimes help (or interact with) IDE-less coding friendliness? These seem orthogonal to me.

Lastly, I think Rust macros are the best pro-IDE argument here. Compared to C/C++, the lower effort required (and higher quality of tooling available) to quickly expand or parse Rust macros means that IDE support for macro-heavy code tends to be much better, and much better out of the box without editor customization, in Rust. That's not an endorsement of macro-everything-all-the-time, just an observation re: IDE support.

jeltz · 2025-01-11T17:58:20 1736618300

Have you actually tried coding Rust without IDE support? I have. I code C and Rust professionally with basically only syntax highlighting.

As for how lifetimes help? One of the more annoying parts of coding C is to constantly have to look up who owns a returned pointer. Should it be freed or not?

And I do not find into() to be an issue in practice.

zbentley · 2025-01-05T15:47:22 1736092042

How do you decide what externally available packages to store/cache in artifactory?

I’m curious, as I also deal with this tension. What (human and automated) processes do you have for the following scenarios?

1. Application developer wants to test (locally or in a development environment) and then use a net new third party package in their application at runtime.

2. Application developer wants to bump the version used of an existing application dependency.

3. Application developer wants to experiment with a large list of several third party dependencies in their application CI system (e.g. build tools) or a pre-production environment. The experimentation may or may not yield a smaller set of packages that they want to permanently incorporate into the application or CI system.

How, if at all, do you go about giving developers access via jfrog to the packages they need for those scenarios? Is it as simple as “you can pull anything you want, so long as X-ray scans it”, or is there some other process needed to get a package mirrored for developer use?

CableNinja · 2025-01-05T16:55:08 1736096108

Where i am, every package repo - docker, pypi, rpm, deb, npm, and more - all go through artifactory and are scanned. Packages are autopulled into artifactory when a user requests the package and scanned by xray. Artifactory has a remote pull through process that downloads once from the remote, and then never again unless you nuke the content. Vulnerable packages must have exceptions made in order to get used. Sadly, we put the burden of allowances on the person requesting the package, but it at least makes them stop and think before they approve it. Granting access to new external repos is easy, and we make requesting them painfree, just making sure that we enable xray. Artifactory also supports local repos so users can upload their packages and pull them down later.

zbentley · 2025-01-02T23:10:24 1735859424

> Even if you decide to check-in the generated code for visibility.

I prefer checking it in by default (generated and checked in by users; CI failing if re-generation during the build generates diff).

It enables much simpler debugging collaboration ("do you have diff in the gen/ dir when you try to repro this bug? I don't."), mistaken-checkin prevention ("did you accidentally run protoc on the wrong version before adding files to this commit? CI's failing because it sees changes in gen/ without changes to requirements or .proto files") and easier verification of upgrades just like this one.

With the ability to use .gitattributes to suppress generated diff visibility by default widely supported (if not well-standardized) across Git repo management platforms, the drawbacks of checking in generated sources are minimal.