I don’t get it. How could you be fsyncing the WAL in 600ns? What are the transac...

mehrant · 2025-03-17T15:48:21 1742226501

that's a great question. the 600ns figure represents our optimized write path and not a full fsync operation. we achieve it -among other things- through:

1- as mentioned, we are not using any traditional filesystem and we're bypassing several VFS layers.

2- free space management is a combination of two RB trees, providing O(log n) for slice and O(log n + k) - k being the number of adjacent free spaces for merge.

3- majority of the write path employs a lock free design and where needed we're using per cpu write buffers

the transactional guarantees we provide is via:

1- atomic individual operations with retries

2- various conflict resolution strategies (timestamp, etc.)

3- durability through controlled persistence cycles with configurable commit intervals

depending on the plan, we provide persistence guarantee between 30 sec to 5 minutes

buenzlikoder · 2025-03-17T16:03:18 1742227398

What storage backend are you using?

A write operation on a SSD takes 10s of uS - without any VFS layers

mehrant · 2025-03-17T16:17:01 1742228221

sorry for not being clear again. by saying this number does not represent full fsync operation, I meant it doesn't include the SSD write time. this is the time to update KVs internal memory structure + adding to write buffers.

this is fair because we provide transactional guarantee and immediate consistency, regardless of the state of the append-only write buffer entry. during that speed, for a given key, the value might change and a new write buffer entry might be added for the said key before the write buffer had the chance to complete (as you mentioned the actual write on disk is slower) but the conflict resolution still ensures the write of the last valid entry and skips the rest. before this operation HPKV is acting like an in-memory KV store.

addaon · 2025-03-18T13:50:25 1742305825

You’re getting a lot of crap (rightly) for your lack of clarity and fuzzy language use on this point…

But that also points out the demand for the seemingly-unachievable promises you’re making. I wonder if it’s worth stirring up some out-of-production DIMM-connected Optane and using that as a basis for a truly fast-persisted append-only log. If that gives you the ability to achieve something that’s really in demand, you can go from there to a production basis, even if it’s just a stack of MRAM on a PCI-e card or something until the tech (re-) arises.

UltraSane · 2025-03-19T06:17:12 1742365032

you can just use NVDIMMs which are generally 8, 16, or 32GB DIMM modules that have a enough flash and backup power to copy all data to the flash storage if power is lost on the host.

https://www.micron.com/content/dam/micron/global/public/prod...