Materialize is a cloud operational data store (ODS) enabling fast, incremental, and consistent transformations of live data using SQL.
We’re hiring Staff Software Engineers to build high-performance, distributed systems in Rust, focusing on efficient data ingestion and delivery. You’ll work on data integration pipelines, leveraging cutting-edge concepts like differential and timely dataflow to ensure correctness and efficiency.
What we’re looking for:
- 5+ years of systems-level engineering (Rust, Go, C/C++, or similar; Rust experience a plus).
Materialize is a cloud operational data store (ODS) enabling fast, incremental, and consistent transformations of live data using SQL.
We’re hiring Staff Software Engineers to build high-performance, distributed systems in Rust, focusing on efficient data ingestion and delivery. You’ll work on data integration pipelines, leveraging cutting-edge concepts like differential and timely dataflow to ensure correctness and efficiency.
What we’re looking for:
- 5+ years of systems-level engineering (Rust, Go, C/C++, or similar; Rust experience a plus).
> Sure, we’re not yet injecting storage faults, but then formal proofs for protocols like Raft and Paxos assume that disks are perfect, and depend on this for correctness? After all, you can always run your database over RAID, right? Right?
> If your distributed database was designed before 2018, you probably couldn’t have done much. The research didn’t exist.
I'm trying to understand this part but something seems off. It seems to imply that those proofs do not apply to the real world because disks are not perfect, but neither is RAM. The hardware for both RAM and disks has an inherent error rate which can be brought down to arbitrary levels by using error correction codes (e.g ECC RAM).
I'm assuming that the TigerBeetle correctness proofs are predicated on perfect RAM even though in reality there is a small probability of errors. This tells me that there is an error rate which they consider negligible. If that's the case what is the difference between:
* TigerBeetle's storage approach
* Paxos or Raft with enough error correction on disk writes that the probability of errors equals that of ECC RAM which is considered negligible
I've probably made a logical error in my reasoning but I can't see it. Can someone enlighten me?
Not an expert in this area, but I think disks have correlated failure modes whereas CPUs and memory generally don't.
Especially spinning platter disks, not sure about SSDs.
The difference in failure rates could be orders of magnitude ... Memory will have random bit flips but I think they are pretty randomly distributed (or maybe catastrophic if there is some cosmic event)
But disks will have non-random manufacturing issues. I'd be interested in more info too, but my impression is that the data on these issues is pretty thin. Foundation DB mentioned it ~10 years ago and Google has published data >10 years ago, but hardware has changed a lot since then
Software redundancy will take care of non-correlated failures, but it fails precisely when there are correlated ones
I work on a large distributed database system. SSDs absolutely have correlated failures. Also CMOS batteries. Also CPU and memory (think a manufacturing defect or a storage climate issue on specific batches that made it through QA). Pretty much nothing is 100% guaranteed to have no correlated failures. It comes down to probabilities. You can add flexibility, variation, vendor/sourcing diversity to reduce risks.
Went through a rather large batch of OCZ SSDs that all failed within a two week window years ago. Thankfully the IBM Death Star had long before made me allergic to putting devices of the same model in the same RAID array if I can help it, so it was a nuisance rather than a disaster.
SSDs tend to have highly correlated failure modes because you either run into a bug in the firmware which is the same on every SSD or you have the same wear on every SSD, which locks both into read-only mode within a short period of time. You might argue that read-only is not a failure, but read-only means downtime and replacing hardware.
> Memory will have random bit flips but I think they are pretty randomly distributed (or maybe catastrophic if there is some cosmic event)
Cosmic rays cause random bit flips, sure.
However, DIMMs going bad results in the same areas corrupting over and over. If you've run memtest or such with bad DIMMs, you'll see it telling you exactly which DIMM is bad, etc.
Now, dynamic memory management and virtual memory mapped onto physical memory complicate that picture... but you could easily end up with a single buffer used for e.g. TCP receive that lives in the same physical RAM region for the lifetime of the process.
Similarly, firmware bugs have resulted in very deterministic corruptions.
Thanks for the question! Joran from TigerBeetle here.
The research in question is the 2018 paper from UW-Madison, “Protocol-Aware Recovery for Consensus-Based Storage” (PAR) [0] by Ram Alagappan, Aishwarya Ganesan, as well as Remzi and Andrea Arpaci-Dusseau (who you may recognize as authors of OSTEP).
PAR won best paper at FAST '18 for showing that a single disk sector fault, in the write-ahead log (WAL) of a single replica, could propagate through the distributed RAFT or MultiPaxos consensus protocol, to cause global cluster data loss.
This was counter-intuitive at the time, because PAR showed that the redundancy of these consensus and replication protocols did not in fact always imply fault-tolerance, as had previously been assumed.
The reason is, and we cover this in depth in our recent QCon London talk [1], but it was assumed that checksums alone would be sufficient to detect and recover from storage faults.
However, while checksums can be used under the “Crash Consistency Model” to solve consistency through power loss, PAR showed that checksums are not sufficient to be able to distinguish between a torn write at the end of the (uncommitted) WAL caused by power loss, and a torn write in the middle of the (committed) WAL caused by bitrot.
What you tend to find is that the WALs for many of these protocols will truncate the WAL at the first sign of a checksum mismatch, conflating the mismatch with power loss when it might be bitort, and thereby truncating committed transactions, and undermining quorum votes in the Raft or MultiPaxos implementations.
RAID solutions don't always help here, either. See "Parity Lost and Parity Regained" [2] for more details. ZRAID is better here, and ZFS is a huge inspiration, but with local redundancy under ZFS you're still not leveraging the global redundancy of the consensus protocol as well as you could be.
To summarize PAR:
There are fundamental design changes to both the global consensus protocol and the local storage engine that would need to be made, if the storage fault model of PAR (and TigerBeetle) is to be solved correctly.
Furthermore, few simulators even test for these kinds of storage faults. For example, misdirected I/O, where the disk writes or reads to or from the wrong location of disk, which may yet have a valid checksum.
However, this is important, because disks fail in the real world. A single disk has on the order of a 0.5-1% chance of corruption in a 2 year period [3]. For example, a 5 node cluster has a 2.5-5% chance of a single disk sector fault, which again in terms of PAR can lead to global cluster data loss.
On the other hand, memory (or even CPU) faults, assuming ECC are not in the same order of magnitude probability, and therefore TigerBeetle's memory fault model is to require ECC memory.
But, again, to be crystal clear, checksums alone are not sufficient to solve the consensus corruption issue. The fix requires protocol changes at the design level, for the consensus protocol to be made storage fault-aware.
[1] “A New Era for Database Design” (we also dive into the research surrounding Fsyncgate, looking into the latent correctness issues that remain) https://www.youtube.com/watch?v=_jfOk4L7CiY
> However, while checksums can be used under the “Crash Consistency Model” to solve consistency through power loss, PAR showed that checksums are not sufficient to be able to distinguish between a torn write at the end of the (uncommitted) WAL caused by power loss, and a torn write in the middle of the (committed) WAL caused by bitrot.
The PAR paper states that "although Crash preserves safety, it suffers from severe unavailability". I assume that when TigerBeetle loads state from RAM into a CPU cache/register it operates under the NoDetection consistency model or the Crash consistency model if ECC RAM automatically resets the CPU on read errors. At the same time it doesn't suffer from severe unavailability so what gives?
The answer is probably that ECC RAM is just reliable enough that the NoDetection/Crash models are fine in practice.
I can believe that off-the-shelf checksum and redundancy options offered by filesystems like ext4 and ZFS or systems like RAID don't hit the required error probabilities but why does the argument stop there? Couldn't a distributed database generate error correcting data on every write in the application layer so that the probability becomes low enough such that NoDetection/Crash become a non-issue for storage, just like RAM? Is there some other fundamental difference between reading and write data from RAM versus a disk?
The crux of the problem: How do you solve misdirected read/write I/O? Where the firmware writes/reads to/from the wrong disk sector (but with a valid checksum)?
PAR shows how both global consensus protocol and local storage engine need to be modified for this, with foundational design changes at the protocol-level, if a distributed system is to not only preserve correctness, but also optimize for high availability.
Bear in mind that PAR is not only actually correct, but it's also more efficient than simply dialing up local redundancy, because it lets you recover from the global redundancy that you have via replication in the consensus protocol.
The paper is great, but will especially reward a few passes of reading. The examples they give take time, but are great to work through slowly to gain a deeper understanding.
And/or, you can read the Zig code of PAR in TB! :)
> The crux of the problem: How do you solve misdirected read/write I/O? Where the firmware writes/reads to/from the wrong disk sector (but with a valid checksum)?
Can't you make the expected location of the data part of the checksum?
Concretely,
- switch from checksums to hashes
- use something like Blake3 as keyed hash with the WAL offset as key.
Now, you can't accidentally read WAL block #5 instead of #7, as it's recorded hash won't match H(data, key=7).
Similar more old school technique: storing the expected role & id of a block inside the block can make storage more robust.
> Can't you make the expected location of the data part of the checksum?
Yes, and in fact we do this already in TigerBeetle (specifically towards solving misdirected I/O, along with hash chaining). Coincidentally, we used to use Blake3 but have since moved to AEGIS for hardware acceleration.
However, and this begins to hint at the problem, but redundancy alone is not sufficient. For misdirected I/O, we are already encoding more into the checksum...
And, PAR goes beyond this. For example, how do you disentangle corruption in the middle of the committed write ahead log, from a torn write at the end of the WAL due to power loss? For this, to solve this correctly (to decide whether to repair a committed operation or truncate an uncommitted operation respectively, for correctness and high availability), you really do need two WALs... and integration with (or awareness of) the invariants of the global consensus protocol—as the paper motivates.
you just bury these and play games with p. just like distributed consensus, there is no perfect storage medium. 1 bit flip a week too much for you? add secded. secded on memory. interleaving to spread out correlated errors. etc.
at some point you run in the probability of the earth being coincident with the sun and you call it good.
none of viewstamp replication, paxos, or raft deal with storage errors.
where it does get interesting is that in the low p spectrum can you can subvert the correctness of these protocols by fuzzing them. and then you get into byzantine protocols.
We tried to emphasize “Protocol-Aware Recovery for Consensus-Based Storage” in the blog post, because it's how TigerBeetle solves the storage fault model, and because PAR shows how you can actually solve this using the redundancy you already have in the global consensus protocol.
Note however broader Const Generics, allowing you to use your enumerated Hat type as a constant parameter to a type, producing PigInAHat<Hat::StrawBoater> as a distinct type from PigInAHat<Hat::Top> are not yet in stable Rust and aren't on the near horizon.
The idea is that any types which Rust can see for itself can be trivially compared for equality will eventually qualify for this purpose, so your Hat enumeration, or a structure with four integers and a boolean in it, or anything that Rust says "Yeah, I'll just memcmp that for equality" would qualify. No floating point numbers, no strings, nothing actually crazy. But this is in some undefined future, for now we just have min_const_generics which is basically the fundamental integral types as constants in generics.
I must be missing something. The paper describes the algorithm of the CRDT and mentions that "timestamps ’t need to be globally unique and totally ordered".
Then it mentions multiple times that Lamport clocks/timestamp can be used as the timestamp in their system but as far as I know these only give a partial order of events. How is this reconciled in their system?
My understanding[1] is that you would not use only a Lamport timestamp but rather a tuple (Lamport, actor ID), where the actor ID is a globally unique per-node ID (say, a UUID), which would then be used as a tiebreaker in cases where comparing by the Lamport timestamp alone would give an ambiguous ordering.
This should not be problematic, since the Lamport timestamps are only equal in cases where there is guaranteed to be no causal relationship between the events (i.e. the user did not see the effects of one event before submitting the next event), so it's fine to pick the ordering arbitrarily.
Every partially ordered set S can easily be "upgraded" to a totally ordered set by simply "extending" every element of of S with an element of another set A, where A has a total ordering. The ordering on elements of A is used to break any ties when ordering the elements of S.
So for a Lamport timestamp, you go from the 1-tuple (timestamp) to the 2-tuple (timestamp, id) where 'id' is some arbitrary ordered key, unique for each process in the system. For instance every process in the system could just generate a large GUID, or a really big integer, or anything else. Then for any event e1 and e2, if the timestamps are equal, just tiebreak based on the ordering of IDs.
(This ability to "upgrade" a Lamport timestamp to have a total ordering is actually covered in Lamport's original paper at some point IIRC, but I don't have it open.)
For other HN users (and future reference) the short film is titled "Room 8" by James W Griffiths [0], it won the BAFTA 2014 award for short film [1] and is based on a short script written by Oscar winning writer Geoffrey Fletcher.
I'd be interested to hear about the failure modes of I2C that you've observed. Unfortunately I'm not familiar with CAN or SPI. Do these alternatives provide error correction as part of the protocol? Do they offer superior reliability in some other way?
1) Errant clock pulses--either through reflection or ground bounce since your system isn't differential--which put your system into an unknown state since every clock pulse is relevant to transaction state. And the fact that the lines are merely pulled up rather than actively driven makes them more susceptible.
2) Hung buses--once the bus is hung, there is no guaranteed way to reset it. You have to rely on your slave I2C chip actually having a timeout reset.
3) Transactions often complete but can ACK wrong. This is bad if you're doing something that isn't idempotent.
4) Nobody ever gets an I2C module completely correct. I2C has a bunch of little corners (Can it send just an address byte? Does it really get clock stretching correct? Does it send the correct events to the CPU on restarts?) and everybody always seems to miss at least one of them.
SPI: Not great, but the extra control lines and the fact that nothing is bidirectional is an asset for reliability.
The primary advantage in reliable systems is that SPI has a CS/SS line that serves as a reset. Even if your clock bounces or a slave chip gets confused, you can often detect it and drop the CS/SS before you complete the requisite number of SCLK cycles and prevent the transaction from completing. Also, dropping the CS/SS almost always frees the SDI/MISO line even if the chip goes haywire.
CAN: Specifically designed for harsh environments with voltage spikes, temperature fluctuations, RF interference, etc.
Fully differential so resistant to noise--some topologies can even survive a break in one of the lines. Retransmits are baked into the hardware. Error correction is baked into the protocol. System does baud-rate adjustment on the fly so it handles frequency drift.
The downsides are generally more complexity (although that is buried in silicon), external transceivers and normally more current consumption during operation.
SPI is similar to I2C in concept, but there's separate RX and TX data lines and a discrete "chip select" signal for addressing. The chips drive both high and low rather than relying on a pull-up, so SPI can operate significantly faster than I2C.
CAN allows many nodes to communicate via broadcast messages on a shared twisted pair of wires called a bus. The signals are differential, making them fairly immune to noise. The messages are CRC'd and acked, and are automatically retransmitted as necessary. CAN is "content addressed" rather than addressed-to-recipient, and each message's address also acts as a priority for arbitration of the bus; the highest priority message ready to be written always gets the next opportunity to transmit. To make things annoying, messages can only contain 8 bytes of payload.
I2C, SPI and CAN all have different purposes. I2C is nice for interfacing to low data rate chips because of the low signal count. SPI is nice for interfacing to high data rate chips but is 4 signals at least. CAN is nice for connecting microcontrollers that are far apart from each other.
Folks abuse I2C because it's only two signals, and run it over connectors. The protocols built on top of I2C quickly fall apart when the signals are flaky, leading to firmware pain. SPI is harder to abuse because no one wants to run 4 comms signals over connectors, and typically at several MHz, electrical engineers know it's a bad idea. CAN is very robust when done correctly (good enough for drive by wire in cars), but it's also orders of magnitude more complicated.
The typical way to add CAN to a raspberry pi is an MCP2515 chip. The linux driver for the MCP2515 can barely service the chip fast enough though, which means it tends to drop data at the higher bitrates.
> The typical way to add CAN to a raspberry pi is an MCP2515 chip. The linux driver for the MCP2515 can barely service the chip fast enough though, which means it tends to drop data at the higher bitrates.
This isn't always the fault of the driver. The MCP2515 is just a kind of crappy controller and has lots of bugs in spite of how much it's used. Even on a Beaglebone (which doesn't struggle to keep up), the MCP2515s are somewhat suspect.
The Beaglebones have a CAN controller on board, so all you need to do is hook up a CAN transceiver and you're good to go. (Being able to use Wireshark to debug CAN is really quite a nice change over most nasty CAN tools ...) And because the controller is part of the silicon, it's really fast.
I didn't mean to imply that the linux driver is of poor quality. It's just a matter of the kernel's inability to schedule "soft interrupt" handlers with low enough jitter to guarantee the hardware is serviced fast enough. The MCP2515 certainly doesn't make it easy as it only has two RX mailboxes without FIFO semantics. It's kind of funny because any old 8-bit MCU has plenty of oomph to reliably drive an MCP2515 but rediculously powerful 64-bit SoC boards can't handle it.
balena founder here. balenaOS does use the hardware watchdog available on the raspberry pi to detect CPU lock ups and automatically reset the board. On top of that we're also running software watchdogs that check the health of key system components and restart them if they become unhealthy.
It's true that SD cards are known for getting corrupt. balenaOS separates the partitions of the device and keeps the userspace in a readonly one while keeping mutable OS state in a separate partition. We are very conscious of writing as infrequently as possible to the SD card for this reason. The partition that accepts the most writes is the one holding the user container, which will get written to during an update, and also in case the user container stores any data on the device.
I'm aware that SD cards will internally swap blocks and don't really care about partition boundaries but assuming you're using an SD card with a well designed firmware it shouldn't lose a block during wear leveling.
That said, the SD card problem is one reasons we designed balenaFin :)
I've personally seen many dozens of SD cards fail such that they don't even know their own sector count anymore. My suspicion is that in-progress wear leveling operations aren't rebust to sudden power loss (ejecting card without unmounting it). Most likely firmware specific...
Yes, and this problem is exacerbated by the raspberrypi's microUSB power input. We've seen countless number of cases where a Pi exhibiting SD card corruption was also experiencing undervoltage. If you're using raspberrypis in production it's paramount to have a good microUSB cable and power supply
Materialize is a cloud operational data store (ODS) enabling fast, incremental, and consistent transformations of live data using SQL.
We’re hiring Staff Software Engineers to build high-performance, distributed systems in Rust, focusing on efficient data ingestion and delivery. You’ll work on data integration pipelines, leveraging cutting-edge concepts like differential and timely dataflow to ensure correctness and efficiency.
What we’re looking for:
- 5+ years of systems-level engineering (Rust, Go, C/C++, or similar; Rust experience a plus).
- Strong CS fundamentals (B.S./M.S. equivalent).
- Bonus: distributed systems, SQL databases, CDC, or stream processing experience.
Compensation: $200K–$225K + equity.
Backed by Kleiner Perkins and Lightspeed ($100M+ raised), we’re remote-friendly and committed to building a diverse, world-class team.
reply