I've expected this to happen, but tbh i didn't expect it to take this long.
They went for a fundamentally "lock in to a startup DB that's fully proprietary so you're screwed if we go under" which is now happening.
Open source _helps a bit_, but it doesn't change that customers now have to self-manage a database. I'm guessing most of their customers are not particularly savvy at DB self-hosting.
It's a single 4992 line HTML file, which is the source code, making a lot harder to contribute to or maintain. I'd like it a bit more (and I might even contribute) if it had at least some separation of concerns, and a simple build system even if the build system is just a bash command to concatenate source files into one HTML file (if having 1 HTML file is somehow very important, but it's not).
I know it's controversial to do things this way, but given it's a personal project and focuses on portability, I think the benefits are worth while.
I have taken care to break things up, comment, etc - if you start at main(), you may find it's not any more difficult to follow than if i'd broken it up. Just because it's one file doesn't mean it doesn't separate concerns.
>"Contributions are very welcomed. Please feel free to submit proposals directly in the form of a PR or Issue."
Good luck getting contributions. I don't have the time to read and understand all 4000+ lines of this very long file. I have ideas, but I'm not even interested in participating because of the format.
the 600ns figure represents our optimized write path and not a full fsync operation. we achieve it -among other things- through:
1- as mentioned, we are not using any traditional filesystem and we're bypassing several VFS layers.
2- free space management is a combination of two RB trees, providing O(log n) for slice and O(log n + k) - k being the number of adjacent free spaces for merge.
3- majority of the write path employs a lock free design and where needed we're using per cpu write buffers
the transactional guarantees we provide is via:
1- atomic individual operations with retries
2- various conflict resolution strategies (timestamp, etc.)
3- durability through controlled persistence cycles with configurable commit intervals
depending on the plan, we provide persistence guarantee between 30 sec to 5 minutes
I didn't necessarily mean exactly fsync. I guess I'll ask: Is it actually flushed to persistent disk in 600ns such that if the node crashes, the data can always be read again? Or does that not fully flush?
I have a product to sell you with a postgres interface but p99 write latency of 100 nanoseconds. It's postgres but our driver says "write done" before a write completes. It's revolutionary!
A memory copy plus updating what ever internal memory structures you have is definitely going to be over 1us. Even a non-fsync NVMe write is still >=1us, so this is grossy misleading.
our p50 is indeed 600ns for write, the way I explained it. I understand that at this point, this can be read as "trust me bro" kind of statement, but I can offer you something. we can have a quick call and I provide you access to a temp server with HPKV installed on it, with access to our test suit and you'll have a chance to run your own tests.
this can be a good learning opportunity for both of us (potentially more for us) :)
if you're interested, please send us an email to support@hpkv.io and we can arrange that
The question from most of us isn't "did you get that number," it's "what does that number actually mean?" Writes don't need to return any data, so you can sort of set that latency number arbitrarily by changing the meaning of "write done." I can make "redis with 0 write latency" by returning a "write done" immediately after the packet lands, but then the meaning of "write done" is effectively nil.
In every persistent database, that number indicates that an entry was written to a persistent write-ahead log and that the written value will stay around if the machine crashes immediately after the write. Clearly you don't do this because it's impossible to do in 600 ns. For most of the non-persistent databases (eg redis, memcached), write latency is about how long it takes for something to enter the main data structure and become globally readable. Usually, "write done" also means that the key is globally readable with no extra performance cost (ie it was not just dumped into a write-ahead log in memory and then returned).
In a world where you spoke about the product more credulously or where code was open-source, I might accept that this was the case. As it stands, it looks like:
1. This was your "marketing gimmick" number that you are trying to sell (every database that isn't postgres has one).
2. You got it primarily by compromising on the meaning of "write done," and not on the basis of good engineering.
To clarify what our numbers actually mean and address your main question of "what does that number actually mean":
1- The 600ns figure represents precisely what you described - an in-memory "write done" where memory structures are updated and the data becomes globally readable to all processes. This is indeed comparable to what Redis without persistence or memcached provides. Even at this comparable measurement basis (which isn't our marketing gimmick, but the same standard used by in-memory stores), we're still 2-6x faster than Redis depending on access patterns.
For full persistence guarantees, our mean latency increases to 2582ns per record (600ns in-memory operation + 1982ns disk commit) for our benchmark scenario with 1M records and 100-byte values. This represents the complete durability cycle. This needs to be compared with for example Redis with AOF enabled.
2- I agree that the meaning of "write done" requires clear context. We've been focusing on the in-memory performance advantages in our communications without adequately distinguishing between in-memory and persistence guarantees.
We weren't trying to hide the disk persistence number, we simply used "write done" because in our comparison we compared with Redis without persistence. but mentioning the persistence made an understandable confusion. that was bad on our part.
Based on your feedback, we'll update our documentation to provide more precise metrics that clearly separate these operational phases and their respective guarantees.
UPDATE:
clarification on mean disk write measurement:
the mean value is calculated from the total time of flushing the whole write buffer (parallel processing depending on the number of available cpu cores) divided by the number of records. so the total time for processing and writing 1M records as described above was 1982ms which makes the mean write time for each record 1982ns.
> For full persistence guarantees, our mean latency increases to 2582ns per record (600ns in-memory operation + 1982ns disk commit)
By the way, this set of numbers also makes you look stupid, and you should consider redoing those measurements. No disk out there has less than 10 microseconds of write latency, and the ones in the cloud are closer to 50 us. Citing 2 micros here makes your 600 ns number also look 10x too optimistic.
I would suggest taking this whole thread as less of an opportunity to do marketing "damage control" and more of an opportunity to get honest feedback about your engineering and measurement practices. From the outside, they don't look good.
I also see the update in response to this comment, and it puts everything into perspective. You haven't changed the meaning of "write done," you have just been comparing your reciprocal throughput against Redis's latency, and I think you have been confusing those two.
"600 ns" then really means "1.6M QPS of throughput," which is a good number but is well within the capabilities of many similar offerings (including several databases that are truly persistent). It also says nothing about your latency. If you want to say you are 2-6x faster than Redis, you are going to have to compare that number to Redis's throughput.
Reading your comment about comparing the throughput to Redis, it seems to me that you haven't read the benchmark article really. In there, we're in fact comparing the "throughput" and not the latency. allow me to quote some of the throughput numbers from the article mentioned above:
Single Operation Performance
Redis Single Operations
SET: 273,672.69 requests per second (p50=0.095 ms)
GET: 278,164.12 requests per second (p50=0.095 ms)
HPKV Single Operations
INSERT: 1,082,578.12 operations per second
GET: 1,728,939.43 operations per second
DELETE: 935,846.09 operations per second
Batch Operation Performance
Redis Batch Operations
SET: 2,439,024.50 requests per second (p50=0.263 ms)
GET: 2,932,551.50 requests per second (p50=0.223 ms)
HPKV Batch Operations
INSERT: 6,125,538.03 operations per second
GET: 8,273,300.27 operations per second
DELETE: 5,705,816.00 operations per second
The latency of 600ns as I mentioned is a local vectored interface call and not over the network. the is not how we compared the system with Redis. the above numbers are using our RIOC API over the network, in which HPKV behaves like a server similar to a Redis server.
The numbers above are compared with Redis in-memory and HPKV is still 2-6x faster. even if you assume HPKV as just an in-memory KV store with no persistence.
It's been proven many times that in well-connected networks (e.g. datacenters) H2 is faster, often because all of the things that H3 improves on is now handled in the user context, which negates the overhead.
The benefits only show in poorly connected networks (public internet), so that's pretty exclusively where it should be used - anything internet-facing.
There's ongoing work exploring QUIC-in-kernel-space at https://github.com/lxin/quic, and more generally HTTP/3 will be increasingly optimized over time as it moves towards becoming the majority of HTTP traffic (a few years off, but looks likely eventually). There's no fundamental reason I'm aware of that HTTP/3 would be _inevitably_ slower than HTTP/2, it seems likely for now that it's largely implementation details.
There's plenty of internet-facing cases with average-at-best connectivity where HTTP/3 would be beneficial today, and isn't available (non-megacorp Android apps, CLI tools, IoT, desktop apps, etc). Even on the backend, it's very common to have connections between datacenters with significant latency (e.g. distributed CDN to central application server/database).
The SQLite filesystem is laid out to hedge against HDD defragging. It wouldn't benefit as much as changing it to a more modern layout that's SSD-native, then using NVMe
They went for a fundamentally "lock in to a startup DB that's fully proprietary so you're screwed if we go under" which is now happening.
Open source _helps a bit_, but it doesn't change that customers now have to self-manage a database. I'm guessing most of their customers are not particularly savvy at DB self-hosting.
reply