More

msackman · on Sept 26, 2022

The software industry seems to fetishize complexity. It's extremely harmful, yet often appealing, and I don't know how we can avoid it.

msackman · on Oct 19, 2016

Paxos: I've been trying to look for that. Having cloned the code and grepped for paxos I'm getting no hits. Where is the paxos implementation?

quinthar · on Oct 19, 2016

It's not called out specifically (actually, when writing it I didn't even know what Paxos was, and only realized I had implemented it years later). However, the logic is here: https://github.com/Expensify/Bedrock/blob/master/sqliteclust...

justinclift · on Oct 19, 2016

With these: https://github.com/Expensify/Bedrock/blob/ecda922dc279e06fda...

Do you use the Issue Tracker to keep on top of things like that, and/or prioritise?

quinthar · on Oct 19, 2016

We're using GitHub Issues: https://github.com/Expensify/Bedrock/issues However, honestly those specific issues aren't on the list. In general we focus less on what could happen, and more on what actually does happen. Those specific issues haven't ever occurred, and thus never got "fixed" because they never became real problems. But PRs welcome!!

msackman · on Jan 14, 2016

I'm the author of GoshawkDB.

I've just been watching the talk on this at https://www.youtube.com/watch?v=yE3eMxYJDiE. GoshawkDB has a very similar design wrt the messaging and replication design. In fact, in some places, it appears that GoshawkDB's design is a little simpler.

There are obviously many differences too: for example GoshawkDB runs transactions on the client, GoshawkDB uses Paxos Synod instead of 2PC, and GoshawkDB clients only connect to one server so there are 2 extra hops, but that's a constant so from a scaling pov, it should behave the same.

One of the biggest differences is GoshawkDB uses Vector Clocks (that can grow and shrink) rather than loosely synchronized clocks.

This TAPIR work does look great - I had no idea that it was ongoing. I'll read through the paper too, but it's great that GoshawkDB has so many design ideas in common.

nickpsecurity · on Jan 14, 2016

Very interesting. I went to search for your stuff only to find I had this bookmark:

https://news.ycombinator.com/item?id=10778946

So, I knew it was interesting and possibly great work but didn't have any time to look at it. I'll move it up the backlog a bit. Maybe at least give the papers and stuff a good skimming tonight. :)

Note: Parallel, ongoing, similar work going unnoticed is a huge problem in both IT and INFOSEC. I have a meme here about how we constantly reinvent the same solutions, make the same mistakes, or move at a slower pace due to lack of connections with similar work. I keep proposing something that's like a free and open combo of ACM or IEEE with members who are academics, FOSS coders, pro's... people that contribute at least in writing. Stash papers, code, and forums there. So, odds of serendipity increase. Thoughts?

msackman · on Jan 14, 2016

Heh, definitely! I could ramble for hours about how there are so many disincentives to share anything until launch - the whole industry really isn't set up to actually make progress. However, multiple independent co-invention of stuff is still valuable from a validation point of view. The worst bits I see are when something from 20 years ago gets reinvented. I have emailed Irene just to see if there's anything or anyway to collaborate.

nickpsecurity · on Jan 14, 2016

Great! It's awesome to see you taking the initiative! Might be an exception in the making.

Im about to leave for work but quick question. What I want to see is an open equivalent of Google's F1 RDBMS: the best one. Does yours already provide its attributes, is it more like Spanner jnstead, or what? Aside from CochroachDB, where is OSS on a F1 competitor?

msackman · on Jan 15, 2016

So GoshawkDB doesn't have an SQL engine currently, so in that way it's probably not comparable with F1. GoshawkDB stores and accesses an object graph. Hopefully the howtos on the website will help people get into the mindset.

I'm not sure if it's worth trying to compare anything to Spanner or F1 because unless I'm mistaken, no one outside of Google can use F1 or Spanner - they're not released. So who knows what the performance of these things actually is? There's no way to verify the claims made in any Google paper.

nickpsecurity · on Jan 15, 2016

"So GoshawkDB doesn't have an SQL engine currently, so in that way it's probably not comparable with F1. GoshawkDB stores and accesses an object graph. Hopefully the howtos on the website will help people get into the mindset."

Gotcha.

"I'm not sure if it's worth trying to compare anything to Spanner or F1 because unless I'm mistaken, no one outside of Google can use F1 or Spanner - they're not released. So who knows what the performance of these things actually is? There's no way to verify the claims made in any Google paper."

I think it's worthwhile for these reasons:

1. They describe it in enough detail in their papers for comparisons or maybe clones to be made.

2. That's led to competitors or open-source clones of tech before. Remember map reduce?

3. They've deployed it operationally.

4. Since when can Google not handle a backend tech it claims to be able to do? I thought their rep was solid on that stuff.

So, Google already as a strongly-consistent, scalable, multi-datacenter RDBMS with NoSQL-like performance. If good on enough workloads, that's the best thing I've heard in that category since NonStop SQL. The GPS thing, while brilliant hack, might be hard for adoption. An improvement area CochroachDB people are already targeting. A full clone or competitor could rock DB world as a great alternative to Oracle clusters or NonStop for mission-critical workloads where some delays were tolerable.

msackman · on Dec 24, 2015

For those interested in a bit more detail how GoshawkDB actually works, I've added a (slightly rushed) blog post on the topic, which covers transaction lifecycle. https://goshawkdb.io/blog.html#20151224

msackman · on Dec 24, 2015

I don't think that's the problem though. The problem that I'm thinking of is that when a cluster grows in size, due to the use of consistent hashing, there'll be a set of objects that need to move between nodes. Calculating and achieving that movement is what concerns me. The exact properties are explained early on in http://arxiv.org/abs/1503.04988

I'm not expecting to ever need to model a global property of "these are the set of nodes that are up and running". I always worry about the effect of weird and wacky network issues on such systems.

rystsov · on Dec 24, 2015

By the way, I always was wondering why the schemas with consistent hashing maintained via the ring are more popular approach than treating key space as a line (-∞,+∞), maintaining the explicit map from segments and rays of keys to the replica groups and split the group/segment when replica group becomes too hot or heavy.

IMHO the perfect hashing solves the problem of distributing the data but the load may follow completely different patterns. See theInstagram's Justin Bieber problem.

msackman · on Dec 24, 2015

Yes, you're right. Now where that paper finishes is pretty much where GoshawkDB starts. Because with GoshawkDB, the server controls the object UUId, you can push as much entropy in there as necessary so the problem identified in the paper is gone. However, ultimately what GoshawkDB ends up generating is basically a set of numbers that you interpret as a way to construct an order of the different nodes (it's the path down the tree, as described in the paper). This set of numbers you could intentionally change in order to change the load balancing to exploit locality.

So yes, currently it'll rigorously enforce a uniform balancing. However, it should be possible to move objects around if/when necessary. I doubt that'll be in for 0.2 though!

msackman · on Dec 24, 2015

Having just had a quick look at both, the only point I'm willing to make is that ignite's distributed transactions use 2PC. https://apacheignite.readme.io/docs/transactions

2PC, at least in its usual forms, is not fool-proof: if failures happen in the "right" order then the whole txn gets blocked: it's actually proven to be equivalent to Paxos with F=0 (i.e. no ability to tolerate failure). On this point alone, GoshawkDB stands apart from anything that uses 2PC.

msackman · on Dec 24, 2015

So ZODB has really great integration with Python: the fact that you can just subclass Persistent and get persistent objects is a really nice model.

That type of model is certainly the goal for GoshawkDB but I certainly accept your point that currently the Go client API offers something that is more similar to a document store. With GoshawkDB it really is up to the client as to what API you offer. For languages which allow you full proxies of all object field gettors and settors then you should be able to match ZODB, and that would certainly be the goal. But without that sort of support in the programming language, the only other way to get there is to define your object types externally, and then compile special classes that do the magic for you. That too is something I'll look at for GoshawkDB: it's likely to ultimately be faster than relying on reflection and other techniques.

An awful lot of problems with these sorts of products are caused by the same terminology being used for different meanings. I'm sorry if you feel I've made this worse. Hopefully it is the case that in future releases there will be both more clients and they will offer a model which is closer to what you're expecting with the term "object store".

dozzie · on Dec 24, 2015

> With GoshawkDB it really is up to the client as to what API you offer.

So GoshawkDB is not an object database and apparently never will. That's OK, but don't call it object store. It's misleading, as it uses already-established term for something different, which also already has a name.

msackman · on Dec 24, 2015

So that I understand, what is the definition of object store?

dozzie · on Dec 24, 2015

https://en.wikipedia.org/wiki/Object_database

Mind you, in several programming languages objects can have instance-specific methods (Ruby and Python being notable examples). This alone makes object more than merely a sum of a predefined structure having data fields and (also predefined) schema having functions to operate on those fields.

msackman · on Dec 24, 2015

Thanks for the link. Whilst I'm not arguing with your point, I believe I've never used the term "object database" to describe GoshawkDB, only "object store". I guess I'm struggling to find a more accurate term than "object store" to describe GoshawkDB. I don't like "document store", as to me "document" tends to be something human readable.

dozzie · on Dec 24, 2015

I understand your reluctance for using term "document". I don't like it much either when it comes to JSON entities, but I can't find any better term, much less a widely used one.

Look at the matter this way: CouchDB, MongoDB, and ElasticSearch, all use the term "document" to describe hash containing data of JSON-compatible model. They already do so, and they are quite well known in the field. It only makes sense to follow their lead, so your potential users can recognize capabilities of your application easier.

msackman · on Dec 24, 2015

Excellent points well made :) I think I may well change the description then as you suggest, and once I have clients that support the same API model as things like ZODB then I'd describe it as something like "document store that can also be accessed like an object store" (well, hopefully not that long winded...).

msackman · on Dec 24, 2015

Not quite: Paxos Synod is merely used as the replacement for 2PC as 2PC does not work. So that's just achieving consensus on the txn votes for each node. It's actually vector clocks (and improvements thereof) that manage the dependencies between txns. The novel bits of vector clocks in GoshawkDB is getting them to expand an contract as necessary, rather than just becoming enormous, which happens with some systems.

msackman · on Dec 24, 2015

Hi, I'm the author of GoshawkDB. Thanks for your questions - they're certainly not answered on the website so I'll do my best to answer them here and incorporate them into the website later.

> Curious how this works at scale. For example if a node is down and N requests start blocking, is there an upper bound to N? What happens when this is reached? Does N map to a goroutine? A file descriptor? Is there timeouts?

The blocking actually occurs on a per-client basis: when a client submits a txn to the server to which it's connected, that server will calculate which other server nodes need to be contacted in order for the txn to be voted on. If it cannot find enough servers due to failures then it will outright reject the txn and tell the client that there are currently too many failures going on.

However, the server could find the right set of server nodes and send off the txn to be voted on and send of the txn for verification and validation. Right at that point there could be further failures and so it's not known how far those messages got. It is in this case that the txn could block as until some node recover it's impossible to safely abort the txn. But it's important to note that:

1. If a txn starts blocking, that'll block just that txn and that client (as the client is blocked waiting on the outcome of the txn). No other part of the server or any other server is affected.

2. Other clients can be submitting other txns at the same time that need different un-failed nodes to proceed, and they progress just fine.

3. There is no global determination (yet) of whether a node is failed or not. Thus this approach can cope with really weird network issues - eg random connections between different nodes failing. There is no need for the whole remaining cluster to agree that nodes X, Y and Z are "out" - none of the algorithms depend on that sort of thing.

I hope this answers your questions.

spotman · on Dec 24, 2015

It helps but doesn't quite answer my question yet.

With your situation #1, what if this is very common transaction and therefore you have 100 of these all waiting. What about 1000, 5000 etc. what system resources are used to let these transactions wait indefinitely ( if I understand your semantics with specific regard to blocking )?

Some systems handle this as a failure that is communicated to the client rapidly. Other systems let N clients actually wait indefinitely but at the cost of taking up a thread / file descriptor, etc. in systems that have finite amounts of threads for example this would then be communicated in his paradigm as a such of upper bounds as to how many requests one could have waiting.

So just trying to get a feeling for how this could have infinite amount of waiting transactions due to partial failure and still keep taking requests.

Thanks for the reply, this stuff is always interesting

msackman · on Dec 24, 2015

> With your situation #1, what if this is very common transaction and therefore you have 100 of these all waiting. What about 1000, 5000 etc. what system resources are used to let these transactions wait indefinitely ( if I understand your semantics with specific regard to blocking )?

So, as each client can only have 1 outstanding txn at a time, this requires there are 100, 1000, 5000 etc clients all submitting the same txn at the same time, and presumably they're all connected to the same server node, and for each of those txns, the set of server nodes that need to be contacted has been calculated, and the necessary messages sent. At that point failures occur so it's not known who has received which messages, so all these txns block.

The only system resources that are held at this point is RAM. Yes, it could get so bad that you eat up a lot of RAM - this is release 0.1 after all. There are a few areas where I have tickets open to make sure that goshawkdb can spill this sort of state to disk as necessary, though it may not be too horrible just relying on swap to start with.

> Some systems handle this as a failure that is communicated to the client rapidly. Other systems let N clients actually wait indefinitely but at the cost of taking up a thread / file descriptor, etc. in systems that have finite amounts of threads for example this would then be communicated in his paradigm as a such of upper bounds as to how many requests one could have waiting.

The problem is that in this particular case, the txn could have committed, it's just the server who initially received the txn from the client and then forwarded it to the necessary server nodes can't learn that it's committed due to failures. Now certainly I could make the client informed that the outcome is delayed, but the client may not be able to do anything with that information: it can't know for sure if the txn has been committed or aborted.

The entire codebase is written using actors and finite state machines. Because Go supports green threads and go-routines can be pre-emptively suspended, there is no problem with eating OS threads. In general, the design is such that the only thing that should block is the client and on the server the actor/state-machine that is talking to that client.

spotman · on Dec 24, 2015

Thanks, that answers my question.

Considering more than one exact txn I imagine will hit a single specific node often, at large scale with a single mode down even if that means 5% of transactions block, you are basically growing a queue of waiting work indefinitely with the only upper bound being how much ram you have. Meanwhile 5% of clients will be waiting and this node may take awhile to come back if it needs something.

Once your out of ram / etc constraints the 5% of the system that is not functioning turns into 100% real fast because your capacity to handle the other 95% takes ram or other resources you now have dedicated to waiting.

If what your saying is also each client can only have one blocked transaction that is relevant but doesn't prevent a consumer to spin up as many clients in succession trying to get through.

I would suggest that you have at minimum strict timeouts for the the transactions, in conjunction with an immediate fail if the node is answer is not available right now. So a client would never wait more than X seconds or the transaction is aborted and if necessary rolled back.

What this would create is a system with predictable failure cases when things go pear shaped. You could calculate in advance how much overhead it would add when something goes wrong as you can have a determinant time factor when having clients wait instead of indefinitely.

Furthermore , what if a node never comes back. Somehow there seems to need a transaction failure that is handed back to the client whether it's node is down, node is gone forever , or node simply timed out.

At the end of the day even if your system is able to handle N thousands of transactions pleasantly waiting and can still answer other requests indefinitely that is a great accomplishment, but in practice may not be ideal for many workloads. People and computers tend to both retry interactions with data that are slow or failed and the combination of just taking on more and more work and hoping everything flushes out when things become healthy is a recipe for a thundering herd, and better served by something like an async work queue type of pattern.

Btw I say this w/o looking at your code , just home page, so possible these things exist and it's not clear what the bounds and failure cases are yet.

Keep on hacking on it!

msackman · on Dec 24, 2015

> Considering more than one exact txn I imagine will hit a single specific node often, at large scale with a single mode down even if that means 5% of transactions block, you are basically growing a queue of waiting work indefinitely with the only upper bound being how much ram you have. Meanwhile 5% of clients will be waiting and this node may take awhile to come back if it needs something.

Ahh, no! For each obj, there are 2F+1 replicas, each replica on a different node. For each txn that hits an obj, you only need F+1 of those replicas to vote on the txn. So, provided F > 0, a single failure will never cause anything to block.

> I would suggest that you have at minimum strict timeouts for the the transactions, in conjunction with an immediate fail if the node is answer is not available right now. So a client would never wait more than X seconds or the transaction is aborted and if necessary rolled back.

I agree. I think it would be very difficult for a client to do anything sensible with such information, but even if all I'm doing is getting the client to resubmit the txn verbatim, at least it clears up the resource usage on the server, which is the most important thing.

> Furthermore , what if a node never comes back. Somehow there seems to need a transaction failure that is handed back to the client whether it's node is down, node is gone forever , or node simply timed out.

Well, this only becomes a problem if > F nodes fail and never come back - the whole design of consensus systems is to cope with failures up to a certain threshold. Provided <= F nodes fail, the failures are detected and any txns that are in-flight are safely aborted (or, if it actually committed, that information is propogated) - this is all just usual Paxos stuff. But yes, again, I completely agree: if you have a massive failure and you lose data, then you are going to have to recover from that. For goshawkDB, that's going to require changing topology which is not supported in 0.1, but is the main goal for 0.2.

> At the end of the day even if your system is able to handle N thousands of transactions pleasantly waiting and can still answer other requests indefinitely that is a great accomplishment, but in practice may not be ideal for many workloads. People and computers tend to both retry interactions with data that are slow or failed and the combination of just taking on more and more work and hoping everything flushes out when things become healthy is a recipe for a thundering herd, and better served by something like an async work queue type of pattern.

Oh absolutely. In a previous life I did much of the core engineering on RabbitMQ. It was there that I slowly learnt the chief problem tends to be that under heavy load, you end up spending more CPU per event than under light load, so as soon as you go past a certain tipping point, it's very difficult to come back. I certainly appreciate that human interaction with a data store is going to require consistent and predictable behaviour.

Thanks for your input.

spotman · on Dec 24, 2015

Cool, makes more sense now.

At a high level it would be interesting to make it as durable as possible in situations where say F=2 but you have 30 nodes.

As I understand it, in this hypothetical case, the majority of the system would work fine if 3 nodes fail, but there could also be some percentage of transactions which are unfulfillable due to this scenario while there should be many that can still be answered with confidence. ( if I'm understanding correctly that data does not go into every node with a configuration like this and is consistently hashed across )

So was just curious how a prolonged period of this may turn into an unbounded problem when trying to digest your page on it.

Ultimately it sounds like it's headed in a good path and embracing this partial-fail-but-that-doesn't-mean-show-is-over with specifics is a good thing to get ironed out and would certainly be a key thing to get right early, even if it's largely semantics and client behavior.

Good luck, will try to play with it soon

msackman · on Dec 24, 2015

> As I understand it, in this hypothetical case, the majority of the system would work fine if 3 nodes fail, but there could also be some percentage of transactions which are unfulfillable due to this scenario while there should be many that can still be answered with confidence. ( if I'm understanding correctly that data does not go into every node with a configuration like this and is consistently hashed across )

Yes, you understand it perfectly. :)

> Ultimately it sounds like it's headed in a good path and embracing this partial-fail-but-that-doesn't-mean-show-is-over with specifics is a good thing to get ironed out and would certainly be a key thing to get right early, even if it's largely semantics and client behavior.

Indeed. Semantics are very important and I want to make sure GoshawkDB is easy to understand and reason about at all times.

msackman · on Dec 24, 2015

Hi, I'm the author of GoshawkDB.

I'm no Cassandra expert, but I'm certainly under the impression that Cassandra does not offer full transactions - lightweight txns only that are atomic only for a single row.