Author here. I'm happy to answer any questions although this project was from 10+ years ago so I could be a little rusty.
Over the years I've been trying to find better ways to do this kind of visualization but for other CS topics. Moving to video is the most realistic option but using something like After Effects takes A LOT of time and energy for long-form visualizations. It also doesn't produce a readable output file format that could be shared, diff'd, & tweaked.
I spent some time on a project recently to build out an SVG-based video generation tool that can use a sidecar file for defining animations. It's still a work in progress but hopefully I can get it to a place where making this style of visualizations isn't so time intensive.
I just want you to know how much this visualization was appreciated. In my time working at AWS, I recommended this website to every one of our new hires to learn how distributed consensus works. Know that this has taught probably 50+ people. Thank you for what you’ve built.
Thanks so much for letting me know! It's always hard to tell when I put something out there if it just gets lost in the ether. I'm glad to hear it helped so many folks.
"Suppose Alice has hashgraph A and Bob hash hashgraph B. These hashgraphs may be slightly different at any given moment, but they will always be consistent. Consistent means that if A and B both contain event x, then they will both contain exactly the same set of ancestors for x, and will both contain exactly the same set of edges between those ancestors."
Consider UTXO-based events. There can be an event E1 that consumes UTXO1 and UTXO2 and event E2 that consumes UTXO2 and UTXO3. Hashgaphs that contain one of these events are consistent but their union is not. This can be used to perform some byzantine things, I can think of at least two of them: doublespend and degradation of service.
This paper is a clear example of how to make a thing that has no obvious problems.
I haven't read that paper but it seems like it's fixing a different problem of Byzantine fault tolerance. Most consensus systems that are internal for an organization don't have the Byzantine issue so it simplifies the problem.
This is wonderful. Can I ask how you created it? Stack used and sour e code? I'd love to create something like this to help visualize things I'm working with currently.
It's all done with d3 and JavaScript. The visualizations aren't deterministic so I ended up writing a shitty Raft implementation in JS. Overall it was a terrible approach because it was so time consuming but I made it work. You can find all the source code in this repo: https://github.com/benbjohnson/thesecretlivesofdata
LiteFS author here. I don't disagree with any points in the article but perhaps a reframing could help. I previously wrote a tool called Litestream that would do disaster recovery for a single-node SQLite server and I still think it's a great default option for people starting new projects. Unless you're doing very database-specific things, most SQL will carry over between SQLite and Postgres and MySQL, especially if you add ORMs in the mix. Pick the one that gets you writing code the fastest and you can switch down the road if you need it.
Rather than a paradigm shift or hype, I see distributed SQLite as an extension of a path that devs can go down. With Litestream, the most common complaint I got was that devs were worried that they couldn't horizontally scale with SQLite and they'd be stuck. While you probably won't hit vertical scaling limits of SQLite on most projects, it still caused concern. So LiteFS became a "next step" that a dev could take if they ever got to that point. It doesn't need to be your starting point.
As for the "hacky" solution of txid, I'm not sure why that's hacky. Your application isn't required to use it or the optional built-in proxy but it's available if it fits your application's needs. It also works for plugging legacy applications into distributed SQLite without retrofitting the code. The proposed solution of caching seems orthogonal to the discussion of distributed application data. I don't think any database provider would suggest to avoid caching when it's appropriate but there's plenty of downsides of caching. Hell, it's one of the two hardest problems in computer science.
>most SQL will carry over between SQLite and Postgres and MySQL, especially if you add ORMs in the mix
I think this goes underappreciated, or rather the opposite is overstated.
Sure there are some edge cases that don't work the same, but most apps won't hit those.
My _biggest_ gripe with SQLite so far is the lack of column reordering like other DBs. And my simplistic understanding is that the others do it exactly the same way as you'd do it manually with SQLite - table gets _replaced_ with an identical table with the data correctly ordered and the data is shoved into the new table.
> Sure there are some edge cases that don't work the same, but most apps won't hit those.
That really depends on your modelling style. If you like things like types, SQL-side processing (eg using functions), or covering indexes, then you’ll hit issues every five minutes in sqlite.
SQLite really wants the logic (including consistency logic) in the application, just compare the list of aggregate functions in postgres versus sqlite, or consider that you have to enable FKs on a per-connection basis.
Which I guess is why ORMs help a lot: they are generally based on application-side logic and LCD database.
I'm pretty sure SQLite has covering indexes. And the relatively new strict mode should enforce at least basic types (though if you want to enforce your own rules for things like dates you're still on your own).
I checked to be sure I had not missed it, and didn’t find anything. You have expressions and conditions, but no covering. Obviously you can kinda emulate it by adding the columns you want to cover to the key, but…
> though if you want to enforce your own rules for things like dates you're still on your own
That’s what I was talking about, having richer types, and the ability to create more (especially domains).
Strict tables provides table stakes of actually enforcing the all-of-5-types sqlite has built-in. Afaik a strict mode is something that’s still being discussed if it ever becomes reality.
Ya if my reading is correct this is the poor man's covering index: if all the requested data is in the index key the query will not hit the table, so you can add additional fields at the end of the key to get index-only scans (at a cost, also some flexibility cost e.g. doesn't work with unique indexes).
I guess it's less of an issue in sqlite than in databases with richer datatypes in the sense that all datatypes are ordered and thus indexable.
With a "proper" covering index (an INCLUDE clause in SQL Server or Postgres for example) you add data to the index value. This means it can be retrieved just by looking into the index but
- it's not constrained (e.g. to be orderable)
- it does not affect the behaviour of the index, so you can have covering data in a UNIQUE index, or in a PK constraint (although for the latter one might argue a clustered index is superior)
- it only takes space in leaf nodes, not interior nodes, so you can have better occupancy of interior node pages, less pages to traverse during lookup, and they have better cache residency
- and finally the intent is clearer, when you put everything in the key it does not tell the reader what's what and why it there, and thus makes it harder to evaluate changes
It’s a workaround, but it bloats the interior pages of the index with the covering data, which increases the size of the index and makes lookup less efficient (as they have to traverse more interior pages, and since there are more pages those are less likely to remain in cache).
I use SQLite in my personal projects, not professionally. I was wondering if you could elaborate on what you mean by 'consistency logic' in the context of SQLite.
One reason to reorder columns with SQLite is that if a column is usually null or has the default value, SQLite will not store the column at all if it is at the end of the row. It only saves a couple of bytes per column, but it is a reason to get these columns at the end.
"Missing values at the end of the record are filled in using the default value for the corresponding columns defined in the table schema."
If you have a table with 5 columns and you only insert the first 3 columns (based on create table column order) because the last 2 values are null or default, SQLite will only insert 3 type bytes in the header. However, if the first column (in create table order) is the one you omit, SQLite has to include its type byte, even if the value is null.
I think the bigger issue for many is that tooling, infra(provider), in-house knowledge/skill/experience as well as optimizations may differ quite a lot.
Of course, this will differ a lot between projects.
It is a fairly low level abstraction, but one that does not require a verbose api. There is nothing error prone or hackish about what you have written, it will work for all inputs, it is just low level. You are just used to having other people write this code for you and give you a library. With newer versions of SQLite you could also write
CAST(strftime(“%Y”, game_date)) as INTEGER
Which is somewhat higher level and less easily mistyped
I agree it's less obviously correct, but I bet you could add the extension to sqlite if you feel strongly about it. As an aside '%y' is documented to only work in sqlite for years >= 0000 and <= 9999, so it would behave exactly the same as the code you wrote. especially because you already didn't have to worry about years less than 1000 because the ISO8601 format used for serializing dates in sqlite normalizes them with leading zeros.
for instance `select date(-50000000000, "unixepoch");` returns `0385-07-25`
Interestingly %Y doesn't seem to handle negative dates either if you need to handle BC, so I guess that is one downside for both. This is one reason I sometimes prefer to use low level code even when it is less obviously correct with a cursory glance, because abstractions may not mean what you think they mean, or even worse, may be lying to you. At least with low level code I can reason about how it would behave under certain edge cases I might care about.
Someone I know worked at Lockheed and had every other Friday off and worked 9-5. Not quite a 4 day work week because half the smallish team needed to be on call Fridays.
A lot of government offices (and therefore their contractors) work '9 nines' to get every other Friday off, though in practice it really becomes '9 eights'.
Author here. The comparison was meant to be about how Postgres (or any client/server RDBMS) is typically deployed. Yes, you can deploy Postgres on the same machine but I wouldn't say it's common. Maybe I could have expanded more on that point or simply referenced client/server architecture rather than Postgres so it didn't seem like a straw man argument.
Author here. My goal in the comparison was only in terms of scope, not that Postgres folks should be penalized for having good documentation. I think Postgres is great and it makes sense to use it when it's called for. But I think it can be overkill for many projects.
Estimating the complexity of using a project can be really...complex. I think about systems I have used which make it easy to use a minimal set of features and where I don't have to reason about or be negatively impacted by aspects I do not benefit from, and other systems where things are less easily isolated and more challenging to reason about.
I do think the Postgres docs in particular seek to be a reference in addition to an operating manual and I for one really enjoy them. I think the point is well made that Postgres can be too much (or too much right now) for many projects.
Author here. The single-node restriction for Litestream was one of the main reasons we started LiteFS. There isn't a way to handle streaming backup from multiple nodes with Litestream & S3 as SQLite is a single-writer system and there aren't any coordination primitives available with S3.
I agree that many of the SQLite cloud offerings introduce the same network overhead. With LiteFS, the goal is to have the data on the application node so you can avoid the network latency for most requests. Writes still need to go to the primary so that's unavoidable but read requests can be served directly from the replica. The LiteFS HTTP proxy was introduced as an easy way to have LiteFS manage consistency transparently so you can get read-your-writes consistency on replicas and strict serializability on the primary. That level of consistency works for a lot of applications but if you need stronger guarantees then there's usually trade-offs to be made.
Author here. Cool to see the post make it up on HN again. I'm still as excited as ever about the SQLite space. So much great work going on from rqlite, cr-sqlite, & Turso, and we're still plugging away on LiteFS. I'm happy to answer any questions about the post.
Litestream definitely has a future. Our goal is to keep it as a simple single-node disaster recovery tool though so it won't see as much feature development as something like LiteFS. We've been focused a lot on LiteFS & LiteFS Cloud to get them in a good place but I'm looking forward to going back and updating Litestream more regularly.
Not much feature development is perfectly fine if it works! Things don't have to evolve.
Planning to use litestream as a library to dynamically swap in/out dozens of databases in a process. Looking at the code it'll easily allow that (super clean, kudos!).
So many thanks, it's going to enable a lot of new things!
SQLite has very little per-query overhead (as opposed to a database connection over a network) so I would think you could traverse a graph using multiple small queries rather than using a graph query language.
Author here. My goal with the article was to write about an use of LiteFS that I found to be useful and show the benefits and trade-offs. I don't think it's a general purpose technique for everyone but I think it has its place in some cases. What was it about the post that oozed arrogance?
Over the years I've been trying to find better ways to do this kind of visualization but for other CS topics. Moving to video is the most realistic option but using something like After Effects takes A LOT of time and energy for long-form visualizations. It also doesn't produce a readable output file format that could be shared, diff'd, & tweaked.
I spent some time on a project recently to build out an SVG-based video generation tool that can use a sidecar file for defining animations. It's still a work in progress but hopefully I can get it to a place where making this style of visualizations isn't so time intensive.