
Pastry: A distributed hash table in Go - paddyforan
http://secondbit.org/blog/introducing-pastry/
======
jeremyjh
>We couldn’t seem to find any way to subscribe to events and publish events
without a single point of failure.

Really? You think this doesn't exist? In fact, this is nothing but a
_deployment_ concern for any mature message broker.

<http://www.rabbitmq.com/ha.html>

<http://activemq.apache.org/masterslave.html>

[http://hornetq.sourceforge.net/docs/hornetq-2.0.0.GA/user-
ma...](http://hornetq.sourceforge.net/docs/hornetq-2.0.0.GA/user-
manual/en/html/ha.html)

There are lots more options than these, and you can also use heartbeat/lvs to
take something like redis and make it HA.

I'm glad you had fun inventing your own distributed hash infrastructure, but
please do not attempt to convince other people that there are no other options
out there for reliable and highly available messaging.

~~~
paddyforan
Er. This seems a little overly hostile for something that amounted to "Hey,
here's this fun thing I made. Anyone else want to play with it?"

Rather than arguing this point again, I'll just point at the other comment
thread that addressed these concerns.

Fun fact: I didn't mention highly available messaging in the post at all. I
mentioned single points of failure and bottlenecks. Those aren't quite the
same.

~~~
jeremyjh
A highly available system has no single point of failure.

~~~
paddyforan
It can, however, have bottlenecks.

All of your solutions, based on the conversation below, separate the messaging
into its own service. Which isn't really how I want to work with my software.
I guess the best way I can describe it is as a different aesthetic. There are
certainly ways to accomplish the same thing, technologically, but they all
feel wrong to me. Introducing a message queue, with a broker, and needing to
implement HA on that simply to remove that single point of failure... That
feels like a workaround to me. I prefer working with a decentralised pub/sub
framework. Personally. Personally, that difference is worth the development
time and lack of maturity. To me.

Which is why I slapped an alpha label on the project. Why I made no attempt to
convince people they should switch over, or even that my way was objectively
better than any other way.

I had an itch. I scratched it. I released the backscratcher I used as open
source so other people with the same itch could scratch it. I'm not sure why
some people think this is a contemptible act.

~~~
jeremyjh
Bottlenecks can also be managed with various design and deployment strategies.
There has actually been a lot of thought and time put into some of these
products and they do have to solve the same problems you are concerned with.

Like I said, I'm glad you had fun with it. But you stated in your article,
that there are not existing solutions without a single point of failure, and
that simply isn't true. That's the only part I object to.

~~~
paddyforan
I think we may have to agree to disagree here. You do not like the way I
expressed my dissatisfaction with the previously available solutions, and I do
not like they way you expressed dissatisfaction with my attempt to explain my
predicament.

I stand by what I wrote, as it is the best way I can explain the, frankly,
hard to describe issue I take with separating the messaging component into its
own servers. If I had to rewrite it, I don't know how I could explain it any
differently and get closer to accurately expressing myself. I'm sincerely
sorry if you feel I've misled you or am misleading people; it is not my
intention in the slightest. I can't imagine what I could conceivably stand to
gain from such a deception.

I would, however, suggest that if people are taking their tech advice from me
without doing any research of their own (which is what you seem to be
suggesting, as you've commented several times that the information that you
believe contradicts me is easily available), perhaps my inability to properly
express my reservations with the previously available solutions is not the
biggest issue at play here?

------
Terretta
> _We couldn’t seem to find any way to subscribe to events and publish events
> without a single point of failure. We looked at everything we could find,
> and they all seemed to have this bottleneck._

“The Spread toolkit provides a high performance messaging service that is
resilient to faults across local and wide area networks. Spread functions as a
unified message bus for distributed applications, and provides highly tuned
application-level multicast, group communication, and point to point support.
Spread services range from reliable messaging to fully ordered messages with
delivery guarantees.”

<http://www.spread.org/>

We use this for LAN event communication among standalone servers acting
together, and WAN event communication among POPs acting together, for billions
of events per month. Any server in a group dies, they elect a new master, so
no SPoF.

What was it about your use case that made this feel like a single point of
failure or bottleneck?

~~~
paddyforan
This sounds like something I would have loved to come across. It didn't come
up in any google searches or any conversations with other software people who
I talked to about this problem, so I didn't evaluate it. Quite simply, I had
no idea it even existed.

------
ukd1
Cool idea, but I wonder why they didn't use something like RabbitMQ which
already exists and is proven?

~~~
paddyforan
I looked at it, I promise I did. Really wanted to avoid writing this if I
could.

From what I can see in the docs, RabbitMQ is a client/server relationship.
Meaning there is a server. Meaning a single point of failure and a bottleneck.
I hate those more than I hate writing new software. Not because I think there
is less danger of bugs in this than there is danger that the AMQP server will
fall over... it just feels wrong to me.

I'm not saying, by any stretch of the imagination, somebody should use this
over RabbitMQ. But if a distributed hash table is more suited to their needs,
or if, like me, they're allergic to SPoFs and don't mind if the software is a
little less mature if it means avoiding them... Well, now they have a package
in Go they can use. If they want to.

~~~
JulianMorrison
0mq doesn't have a client/server relationship. <http://www.zeromq.org/>

~~~
paddyforan
I believe I looked at that as well. It was six months ago, so forgive me if my
memory is a bit off.

Looking at the intro, it looks like I have to hardcode the IPs and ports of
all of the machines I want to publish or subscribe to. Is that correct? If so,
that's a little fragile for my tastes. I like "upgrading servers" by standing
up new servers, testing they work, and then changing the floating IP
associated with the DNS to point to them instead.

And I can hear you say "Aha! Floating IP! Just use that!" but for most cloud
providers (I believe), that is billed as external bandwidth, which is not free
like internal requests are. Finally, if I want to stand up more servers to
scale horizontally, I'd still have to modify my code and redeploy it to all my
servers. I think. Unless I'm missing something about how 0mq works. Which I
totally could be.

On the whole, learning what I needed to learn to make Pastry happen made me a
better programmer, too. So if for no other reason than that, I'm glad I did
it.

~~~
JulianMorrison
No more so than with TCP sockets you have to "hard code" the IP connected to.

0mq is low level, it's not going to do routing like Pastry, but you can get
away with having one fixed location for a "lookup server" that keeps track of
everything else's location (something like the Doozer project, also in Go
<https://github.com/ha/doozerd> ).

Also, 0mq only needs to know the IP of the publisher, for pub-sub.

~~~
paddyforan
Very cool. Where were you six months ago? ;)

I was familiar with doozer, but doozer wasn't(/isn't) being actively
maintained and it doesn't compile against the latest version of Go, so I'd
have to bring it up to speed before I could use it anyways. Not saying it
would be _harder_ , but between that and writing the 0mq library (pretty sure
one does not exist in Go yet), I'd estimate the work would be more or less
equivalent and yield just about the same stability. Seat of the pants guess,
but it makes me feel better, at least.

~~~
JulianMorrison
Heh, thanks.

The mailing list suggests <http://github.com/4ad/doozer> and
<http://github.com/4ad/doozerd> are being maintained as forks of the original.
Or there's Zookeeper. Or brew your own central lookup server using 0mq that
does the same job of "set a key, get a key, notify subscribers when it
changes". It's a single point of failure, but since the system will run fine
without it (only lacking topology updates) it's not a hugely problematic one.

~~~
paddyforan
This reply link just appeared for me. Not cool -_-

The lookup server is a single point of failure. And while it's not a hugely
problematic one, it also is a dedicated machine whose sole purpose is keeping
the other machines running. Which feels wrong to me.

It probably wasn't the best business decision to invest time in this, and I
won't even argue this is the best technical solution. But it's the only
technical solution that didn't make me feel like I was working around
limitations; things just worked they way they were supposed to. The API
servers received an event, they told the WebSockets servers about it. It felt
very conceptually pure to me. I'm a sucker for that.

~~~
JulianMorrison
To get the reply on a deep comment, click "link" first to get the comment as a
single page, and "reply" will be there.

Doozer is only slightly a SPF, it has high availability by having multi-master
replication. It's also not doing a lot of communicating, or a lot of CPU work,
and it keeps its data in RAM, so it may be OK for it to live on a non-
dedicated machine.

Sorry if I come across as criticizing your admittedly cool creation. It was
just that you said there weren't alternatives, and I knew of one.

~~~
paddyforan
No, I definitely appreciate the conversation, because I looked for forever for
a pre-built solution to this, and couldn't figure out how nobody had needed a
solution before now. So I'm glad there are solutions, and I'm not just crazy.

------
Saavedro
Should be aware of possible confusion with
<http://en.wikipedia.org/wiki/Pastry_(DHT)>

~~~
paddyforan
Confusion is intentional. This is just an implementation of that algorithm,
with a few tweaks (that are explained, along with their reasoning, in the
README). I believe I even link to that Wikipedia page in the blog post and the
README.

~~~
StavrosK
I didn't see anything in either the Wikipedia page or the code, but how do you
achieve redundancy? Wikipedia says nodes can die with minimal or no loss of
data, but I see everywhere that each message only goes to one node. Which is
the case, and how is this achieved?

~~~
paddyforan
They both are the case. Well, sort of. The messages can go to any number of
nodes as part of the routing, but they're only considered "delivered" at one
node. However, nodes can die with minimal loss of data.

The problem is you're confusing the messages for data. As far as Pastry is
concerned, your messages are not data that it should retain. They are one-off
communications. The only data that Pastry is concerned with storing is the
state of the cluster; that is to say, Pastry only considers "data" the
information about enough nodes to reliably be able to route a message to any
node in the cluster. Your messages, by default, are lost the moment they are
sent, forwarded, or delivered. The node stops caring about them at that point.

However, on top of this framework you can build applications that _do_ care
about the data in the messages. Before a message is forwarded, it fires off a
callback to any application that cares to listen for it. When a message is
delivered, it fires off a callback to any application that cares to listen to
it. You can register callbacks that store this information, then build your
own redundancy frameworks on top of that to store the data you'd like to
retain.

If you're interested in this, I'd recommend paper on PAST. It's a high
availability storage system built on Pastry:
[http://research.microsoft.com/en-
us/um/people/antr/PAST/hoto...](http://research.microsoft.com/en-
us/um/people/antr/PAST/hotos.pdf)

~~~
StavrosK
Ah, I must have confused this for a storage algorithm, rather than a message
passing algorithm. Doesn't "distributed hash table" refer to, well, a
distributed hash table? One that stores key/value pairs? What good is it if
the values are lost right after going in?

~~~
paddyforan
That is what a distributed has table is. And the values and keys aren't lost--
there's just some confusion over what the keys and values are.

The keys and values, at least in Pastry's case, are the NodeID and the
metadata about the Node, respectively. According to my understanding, at
least. I had to learn a lot to implement this, so I would not claim to be an
expert on distributed hash tables by any stretch of the imagination.

~~~
StavrosK
Hmm, so is the use case a distributed task queue? If so, why not use a
consistent hash ring (<http://www.martinbroadhurst.com/Consistent-Hash-
Ring.html>), or just pick a server at random? I'm afraid I'm missing the point
entirely :/ Is it about maintaining a consistent list of member workers?

~~~
paddyforan
A consistent hash ring is a similar concept, I think. We're quickly
approaching the out-of-my-depth line, though.

It's about maintaining the property that you can route information from many
different places and have that information end in a consistent place, no
matter where it's sent from. And, assuming the consistent place doesn't go
offline, a large number of servers can fall over without warning before that
property is changed. It's about making discovery of nodes in your cluster
reliable, and passing information in between them.

The use cases for that are many, and each has their own nuances, which is why
this is so confusing and hard to talk about.

~~~
StavrosK
Right. It sounds like one of those problems that can be solved in many ways,
and a DHT is designed to deal with a specific constraint. Thanks for the
details!

------
matticakes
The focal point is discovery. Not other queues (or other libraries that can
build queues). This is an interesting way to approach it, thanks for open
sourcing.

We chose to solve the discovery problem a bit differently in NSQ
(<https://github.com/bitly/nsq>) but I could certainly see some interesting
opportunities to experiment with a distribute hash table approach as well.

~~~
paddyforan
I was very excited when I saw NSQ. I thought it would solve my problem
splendidly. It seemed, from my reading of the docs, that it did not support
multicast however. :(

I was very impressed with it, though. You guys did an awesome job, and I'm
extremely flattered you stopped by to comment.

~~~
matticakes
It's a bit more explicit in NSQ but we essentially support multicast-like
routing through "channels".

A "channel" receives a copy (at the source) of all the messages for a "topic"
and has its own set of clients subscribed.

~~~
paddyforan
Hm. Very interesting, I'll have to take another look and get a better
understanding of it.

------
realrocker
oh quiet you naysayers. just nod in appreciation of the hard work.

~~~
cnlwsu
Interesting that the same group of people who were proponents and pioneers in
NoSQL now dismissing libraries and new implementations to "solved" problems
with "you can do that with {blank}"

------
joelthelion
Hijacking the topic on P2P: is there a good library (any language will do) for
P2P message passing (as opposed to information storing in a DHT)?

I'd like to experiment a decentralized twitter/reddit-like system using p2p
message flooding and machine learning to weed out spam.

~~~
paddyforan
<http://tent.io/> ?

~~~
joelthelion
BTW, do you know of any real projects using Tent? It does sound interesting.

Going back to P2P: building a flooding P2P library on top of a DHT seems like
a total waste. But maybe the common parts could be extracted.

~~~
graue
Yup. There's a microblogging app (Twitter clone) called TentStatus, available
hosted at <https://tent.is>, or run your own instance of the code:
<https://github.com/tent/tent-status>. Several people are working on iOS and
Android apps to support the microblogging use case.

There's Essayist, a (very alpha) long-form blogging app, hosted at
<http://essayist.mndj.me>, code at <https://github.com/mwanji/essayist>

Meanwhile, the inventors of Tent have 5 people working 60+ hours a week
running Tent.is and developing the protocol and apps for it. They haven't
revealed what's next, though it sounds like some non-status apps are in the
works.

------
spullara
They were already using Redis and made this? They could have just partitioned
and replicated them for scaling.

~~~
paddyforan
Yup. Totally could have made a choice I wasn't happy with.

Opted to make a choice I was happy with instead.

------
igrekel
Next step: a tupplespace in Go?

------
drivebyacct2
Oh man, I was really, really just hoping for this. I've been avoiding putting
one last piece into my server because I didn't want to use redis. I'm going to
play with this in a couple hours.

~~~
paddyforan
Please do let me know how this works out for you.

And I mention it in the blog post, and I mention it in the README, but I feel
I should reinforce this:

This. Is. Alpha. Please don't use it in your mission-critical code.

I'd like to run a lot more tests on it. I'd like other people to run tests on
it and let me know what they think. This is my first time writing anything
this complex, and I want any and all feedback I can get on it.

Thanks!

~~~
JulianMorrison
How are you planning to implement pub-sub on top of it?

~~~
paddyforan
I'll be using the Scribe algorithm, as defined by the authors of the Pastry
algorithm: <http://en.wikipedia.org/wiki/Pastry_(DHT)#SCRIBE>

The basic gist of it is similar to a phone tree. You route a message
subscribing to a topic, basing the message ID on the topic name. As the
message routes through the network, each Node stores the information about who
is subscribed. When an event happens, that information is passed back through
to the subscribed nodes.

I may have to tweak it a little, as I did Pastry, but I'm pretty sure the same
basic concept will work for my needs.

