It's quite common on some social networks to have several thousands/millions subscribers for some pages/communities/accounts.
Is it really wise to build a publish-subscribe delivery system on top of HTTP? This seems to be a huge overhead.
In the meantime XMPP is already offering similar features (XEP-0060: Publish-Subscribe https://xmpp.org/extensions/xep-0060.html) for more than 10 years.
It's implemented in several servers and can handle huge loads without problems (everything is handled in real-time through encrypted TCP sockets accross the network).
We are building social networks on top of XMPP for several years now, you can check Movim (https://movim.eu) and Salut à Toi (https://salut-a-toi.org/) :)
I'm sure there is some use-case where this makes sense, but I agree with you. Probably most people wanting to do large scale pubsub should just be using XMPP, or possibly something like MQTT.
And I agree, it should use something like a Noise protocol even instead of HTTPS.
Mind you XMPP isn't all that efficient either. as its all based on XML.
For the kind of content sent over WebSub (generally an Atom or RSS feed with one or more long messages), it's much less than 50%.
In the embedded world a JSON/XML parser eats a tonne of resources.
One could of course use it over satcomm as it says, but its hilariously expensive when you are paying by the byte. But, compared to a massive JSON goop with embedded pictures that twitter uses, its a paragon of speed.
On the other hand HTTP is a simple protocol which is synchronous in nature.
I'm only just learning about WebSub tonight, but it looks like a lean, efficient, and fairly minimal protocol to me. What gives you the impression that there will be huge overhead - could you be more specific?
When new content is published to a topic in WebSub, it's delivered with an HTTP POST that will look something like this:
POST / HTTP/1.1
Link: <https://hub.example.com/>; rel="hub"
Link: <http://example.com/feed>; rel="self"
Furthermore, it looks like these messages can be sent using HTTP/2, if client & service support it (which is something that you'd prioritize for cases where efficiency matters). HTTP/2 is a binary protocol and takes advantage of HPACK header compression (RFC 7541). This means that if the same header appears in multiple requests, it will be transmitted very efficiently. Thus WebSub headers that are likely to be the same for all requests across a connection (like Host, Content-Type, and Link) will be transmitted virtually for free.
Even the vanilla HTTP/1.1 request described above seems reasonable though -- certainly not something that strikes me as a cost or efficiency problem -- and the HTTP/2 framing of the content is probably going to be not much longer than the content payload.
Now let's compare to XMPP PubSub. From looking at XEP-0060, an item published over that protocol looks like the following - based on Example 101 in: https://xmpp.org/extensions/xep-0060.html#publisher-publish
say hi to mom ...
Based on this naive comparison, I don't see a reason to conclude that WebSub will have more overhead than XMPP PubSub. When implemented over HTTP/2 it may be more efficient.
<body>say hi to mom</body>
That is also the power of Pubsub. It is that it gives you the freedom to put what you want in it (it can be Atom posts like in your example, but also stock market tickets that are pushed each 5 sec, some server monitoring logs...). You define your own namespace, write a little parser for it and use the thing into your XMPP Pubsub library :)
Seems to be working fine for SQS. It all depends on your use-case. For high volume messaging or certain types of messages you might reject WebSub for the same reasons you might reject SQS in favour of AMQP or MQTT etc.
> WebSub was previously known as PubSubHubbub.
What I am curious about are the following questions?
1) What differentiates WebSub from XMPP?
2) What differentiates WebSub from ActivityPub?
3) How are you handling the N-squared delivery problem, if you are delivering content directly to each subscriber with HTTP POSTs?
4) Does WebSub currently support store and forward? If not, is that on a roadmap for a future version?
5) Same as 4, except for support for forms and form responses? Examples are a builtin Yes or No reply, or a builtin poll vote.
6) Same as 4, for automated message routing.
7) Why not have it be transport-agnostic, instead of mandating HTTP? And why HTTP? The growing trend is towards more decentralized.
8) How does this compare to Sir Tim Berners-Lee's SOLID (https://solid.mit.edu/)?
Edit : I don't understand the downvote. I use RSS daily to fetch news and I don't have problem with it. It's simple and deliver news to the edge (my mobile phone). I'm not saying WebSub is useless, I would like to understand what a 3 entities model brings to the table compared to a simple server-client delivery. What's more, the Subscriber entity cannot be a mobile device with the current network, because mobile internet providers block incoming GET requests. Therefore to fetch news, it has to be a pull model.
Why not add a simple rational in the header of the spec, explaining the problem, the existing solutions, and why this new solution? Does it solve a security issue, a scalability issue or a trust issue?
- publisher would send updates, instead of having everyone poll
- publisher would be protected from thundering herd if a content suddenly becomes popular
- publisher and subscribers wouldn't need to exchange a full "page" of items when only one is needed
So far, no one on this entire thread has descried a single use case where WebSub actually provides value. Someone mentioned that theoretically a service like Facebook could use this for Facebook, but then they themselves linked to a page on Quora with an explanation from someone at Facebook for why they tested out PubSubHubbub and then gave up on it which stated that he "think[s] the benefits of adopting PubSubHubbub are less clear" for this very issue (that it is inherently a server-to-server protocol, where there wasn't really a problem in the first place).
As we are trying to explain : this proposal doesn't cover all use cases of RSS+GET pull, so it should not be considered a replacement of RSS nor an enhancement of it.
If WebSub has a purpose (if any), many people fail to explain a valid one. Does it worth wasting the resources of the W3C for such proposal?
The "world wide web" is wide, and doesn't only include servers
That's the root of our misunderstanding then: never was PubSubHubbub _replacing_ RSS+GET pull, it was supposed to _help_ it for people who want notifications, on top of the existing system.
RSS is a 500kb static text file which is updated usually at most every hours. The whole file fits in server RAM and can already be served to thousands of people from cache with minimal CPU usage, or from Content Delivery Networks. A "full page of items" is easily compressed with gzip algorithms.
Surely, the Internet isn't congested because of RSS...
Publisher still has to send the update to the thundering herd, no?
The thundering herd happens when you have an incontrollable influx of traffic coming in, and you have no way to regulate it.
For small self-hosted blogs, authors can pay a CDN that will absorb any exceptional loads and almost cost nothing the rest of the time. Eventually the blog will go down for 24h and will be back online when things get calmer.
The idea of creating a hub that will be a central point is the opposite of distributed information.
Again, totally random fact without any single source to prove it.
WebSub basically comes from the need of that one entreprise? Does it worth making it a W3C specification?
However, it now seems WebSub was created to add a middle-man that will a) read all your activities and b) take a transaction fee on every update?
This is worth pointing out directly: Yes, of course. You always do have to transmit the data to the actual reader. But those are users of the feedreader, they do not contact the source at all.
The advantage of this scheme is that data is transported exactly as often as necessary and as fast as possible, unlike when data is polled.
It has no relation to a CDN I see. A CDN would mean the source has not to handle the incoming traffic, or less of it, but a CDN does not enable push nor reduces it the traffic between CDN and feedreader server.
Again, I'm not too deep into WebSub, but I know the predecessor quite well.
This spec solves a specific problem. You have one URL resource many clients are interested in, that URL gets occasional updates, many clients want to fetch that new content. Think RSS feeds. so far, every client had to pull, meaning look at the file again and again and see whether there is a new update. Think a bit about that and you see how that is hard to do on client side (comparing content, making sure new stuff is really new and not just reordered, storing the old file, etc). Normally, those clients are servers. That's really important to understand, without that knowledge WebSub makes no sense.
Push-based protocols solve that problem with a middleman, the hub. Whenever there is an update the original server sends one single POST to the hub, the hub then sends that notification to all subscribers. And wush, no more polling.
The end result is way less traffic on the lines, way easier architecture on the feedreader side (it can be not built on a polling infrastructure, instead it just has one webhook open, and when it gets notified of an update it fetches the source once), less server load on the origin. For blogs for example that really is relevant.
This has nothing to do with a CDN.
- With today's technology, you don't need an intermediate online, 3rd party, feedreader. Devices can pull data from the sources.
- Not only devices can pull the data, but it's often the only way to get data, since mobile operators block any incoming trafic on HTTP protocol for security reasons.
- People don't need to read 1000 pieces of news in real time.
- With a source and a RSS reader, there is NO HUB at all. How convenient.
- For relatively static content delivery problem, CDN already exists, cache exists since decades and is already part of the base HTTP protocol.
- The only thing that protocol enables so far is middle-men that will try to monopolize and control the news ecosystem, while news delivery is already free (for the end user)
Bonus point : the traffic of text on internet is ridiculously small compared to video and other data. That argument is not a valid one to justify introducing another 3rd party
At some point you have to go outside of the HTTP world and look for what already exists rather than reinventing the wheel.
Does it solve news delivery ? RSS do that already.
Does it solve instant notification and presence notification? Lower protocols already do that with lower overhead.
- This would not change
- WebSub has nothing to do with mobile devices
- This is not up for you to decide, and the server powering online feed reader or platforms like facebook do have a lot of data sources
- That's not a point? For the type of RSS readers you are thinking about that would not change.
- A CDN has nothing to do with this.
- No. It is a decentralized/federated protocol, no middle man can do that.
Your bonus point is invalid: The 3rd party is already there (pubsubhubbub is used in production), this is not primarily about this type of traffic reduction you are thinking about, but yes, the ressources needed to do without websub what can be done with websub on a big scale are enormous.
Again, this is about enabling push architectures for server to server communication and a scalable way to achieve real time notifications. The tiny feed reader application on your smartphone is not directly related to any of this. It would not use this protocol, it can't use this protocol, and it would not stop working because of this protocol.
Edit: I made this sound a bit nicer than initially. It really feels like you want to misunderstand this spec, this annoys me a bit.
The difference between a proposal like ActivityPub and something like WebSub is that ActivityPub solves an actual problem of social network monopoly.
The first products that come up with ActivityPub are free (Mastodon), while the products presented in this thread makes money from that protocol. In my opinion, it's sketchy.
WebSub is a spec from the very same working group that published the ActivityPub spec and WebSub is actively used by eg. the IndieWeb movement that do indeed very much solve social network monopoly as everyone hosts their social profiles themselves there.
I'm subscribed to about 40 RSS feeds on my mobile phone, and I don't even have the time to read everything.
> It's way more efficient at scale (I.e. when feeds are actual people, like in social networks)
Could you show me basic math to prove that? Firms such as Facebook and Reuters already operates at world scale, and don't need a new protocol to deliver news.
RSS with a simple pull GET request is actually more OPEN than this proposal.
> RSS with a simple pull GET request is actually more OPEN than this proposal.
It is not. Also, Websub does not change anything for clients just directly fetching the RSS feed.
> It is not. Also, Websub does not change anything for clients just directly fetching the RSS feed
Indeed, it simply adds a middleman between the content producer and the final consumer.
No, it does not, not for the clients you seem to think about. The regular RSS feed does not vanish.
For platforms like facebook using schemes like this, see https://www.quora.com/Why-doesnt-Facebook-implement-PubSubHu... and https://developers.facebook.com/docs/graph-api/webhooks
That's fine, you precised it would be used between servers. But it limits the "openness" of the concept imho
2. We have HEAD, can we do service discovery using HEAD?
3. Why not let a topic be a HTTP URL? “PUB /user/john/position HTTP/1.1\r\ndata...”.
4. Subscription expiration as a way to force subscribers to renew and upon renew get redirected to other servers is pretty cool. NATS has a special message (the INFO message) to do the same, but you might be in the middle of an important request-reply session you don’t want to abort.
5. The authors could have made this protocol very “non-http-ish” by implementing what amounts to Redis but in HTTP. I’m glad they didn’t. This still feels like HTTP, which is great.
Topic. An HTTP (or HTTPS) resource URL.
Service discovery does appear to support HEAD requests. (See section 4.)
Having a new HTTP verb for subscribing and publishing would seem like unnecessary complexity to me. Rather than ask "why not a new verb", I think a case would need to be made that a new verb is required, that the operation does not cleanly fit into the semantics of existing verbs. The existing verbs are capable of modeling quite a lot.
With the protocol as they've described it, subscribing is just sending an HTTP POST to the hub URL, passing in the topic URL. That's a simple HTTP operation that a lot of clients and programs can be instructed to do easily. Requiring the use of a new HTTP verb will make interoperability difficult without apparent benefit.
Complexity for who?
Introducing a secondary "hub" resource here is just accidental complexity. If I want to subscribe to resource A why am I talking to a different resource B? And once you introduce a secondary resource now you need yet another service discovery mechanism to support discovery of these pseudo-resource hubs. (Heaven forbid using an existing service discovery mechanism like RDDL.)
Honestly stuff like this is just so poorly thought out it's difficult to understand why W3C stamps approval on this crap. There's no consideration given to alternate protocols like WebSockets or XMPP and there's no attempt to layer on top of existing standards in a meaningful way (hub.secret -- really???). Worst of all there's no real understanding here of what it means for a resource to change. The entire Content-Distribution model is geared towards just one very narrow use case.
It's clear the W3C is all about being "inclusive" and "moving fast" there's real fear of "overthinking" things -- but seriously if this is the result we'd probably better off with better standards once a decade then this.
Think about what the hub has to do. It may have to notify millions of subscribers, deal with any errors, retry, etc. This is a very heavy duty messaging system that most publishers will not want to run themselves. And yet you want the publisher's domain name to be the well known resource that ultimately controls things.
Publishers may be blogs hosted on small websites or even things like cars, phones, laptops or home appliances that are not always online or have to work under tight resource constraints.
Publishers may wish to distribute their content through more than one hub. We don't even have to think of avoiding censorship to see why this increases availability.
I think making it possible to split the roles of publisher and distributor is a very good idea. You can still decide to implement both roles on one server.
Funnily enough, the first drafts of this protocol (back then, called PubSubHubbub) were written circa 2008, so this specification is about a decade in the making.
At the time it was distributing content between a number of the bigger blogging/publishing platforms of the day, and also notifying search engines so they could update their indexes more quickly.
If anything it seems like the standardization process was too long and missed the boat here (this particular problem is now most often solved by proprietary protocols), rather than being "rushed through".
Can't deny that the world has changed a lot during the lifespan of this idea, though. Cellular-connected computers in our pockets were barely on the radar when this spec was first written. I'm sure some would argue that the burdens of publishing have now shifted on to the reader (probably battery powered, spotty connectivity) whereas in this spec's original universe the burdens were on the publisher (CDNs not yet as widespread, more independent publishing from web hotels, etc).
WebSub is a protocol for people who want to implement the publish/subscribe pattern over HTTP callbacks (aka webhooks). Using webhooks means that subscribers don't need to have any kind of ongoing connection or session open to receive publishes. Subscribers are passive web servers and merely wait to receive an HTTP POST. No state, no connection, polling, or anything. The general model of HTTP callbacks is a simple scheme that's easy to implement using any programming language or platform out there, all of which have HTTP clients and servers capable of getting the job done with minimal fuss.
I have actually built custom systems that worked using a very similar pattern as this protocol, where clients of a service pass in URLs where they'd like to be notified when an event occurs. Perhaps this is why I find myself nodding along when I read the protocol spec. There wasn't any standard way to model this, so I just invented something on the fly. You also see this pattern implemented in services like AWS SNS's support for HTTP , in Google Cloud PubSub, Twilio, etc. Each of these has an entirely custom protocol for PubSub over HTTP callback, and not something that's standard. They all tackle similar issues like preventing attackers from creating unauthorized subscriptions to URLs, but in different ways.
WebSockets doesn't solve the same problem as WebSub. WebSockets require a continuous connection from a client to a service. An application will need to devise its own logic for resuming a session if the connection breaks.
WebSub requires no active connection nor session. WebSub subscriptions could remain functional for months at a time (really indefinitely), with there being no communications whatsoever between messages. The system that initiates the subscription can be different than the one that receives the publishes, which is valuable because it means that messages don't all have to go to a single place. Publishes are sent to the domain name specified in the subscription URL. The subscribed web server could change regularly and everything will work as long as the DNS name keeps pointing to the right place. You could use multiple web servers to handle the subscription, by putting multiple servers in the DNS record, or you could use a load balancer in the same way as other web requests. This means you can scale easily. These are the kinds of benefits you get from building subscriptions on top of HTTP. All of the standard techniques and standard software "just work".
I'm not an expert on XMPP, but I suspect XMPP would also be a bad fit for this use-case, and would also require continuous connectivity from the subscriber (please correct me if I'm wrong.) I think the same is true for MQTT but I'm not an expert on that either.
As a person who has built and used multiple systems following this general abstract pattern, I think this is a good attempt at drafting a standard protocol. My impression reading the spec is that its designers had a good idea what problem they wanted to solve, and what kind of characteristics they wanted the solution to have, and came up with a protocol that succeeded in meeting those requirements.
What's the objection to hub.secret? That facility doesn't seem essential to a minimal version of a protocol like this, but I understand why they included it. It provides a simple way for the subscriber to authenticate that the content they're receiving is legitimately the result of their subscription to the topic, and not e.g. an attacker's subscription, or an attacker system that's trying to impersonate the hub. How would you tackle this issue in a simpler way? (It would not be easy to solve this problem within the protocol using TLS, for example.)
I still don't see the plus value compared to a simple pull-news-from-URL model. RSS with GET is already session-less, and remains functional for months, AND also works when the clients cannot receive incoming connection (mobile devices)
One of the benefits closed platforms have is that they can deliver posts inside the platform immediately, WebSub brings the option for that to feeds on the open web, without requiring subscribers to poll every <5 minutes and without requiring them to do large changes under the hood, e.g. introducing new non-HTTP protocols which can't be used on all hosting options.
For end-devices other update mechanisms are useful as you say, and systems speaking them could hook onto Websub hubs to get notifications they then translate. E.g. your typical wordpress blog has no chance of offering a XMPP channel, but it can ping a WebSub hub since it's only HTTP.
Pubsubhubbub was a (relative) success because everyone doing RSS feeds could easily add it with their existing tech stack.
Anyway, if the proposal is useful to some people, then it won't do harm to have it in the public domain.
But I love the idea of topics being Uris (not just HTTP)
WebSub solves that by designating a hub that can handle this instead. Ie, you federate your blog feed to elsewhere.
ActivityPub fills another niche, mostly everything around human interaction in social networks.
WebSub could be used to feed data into ActivityPub networks.
Does this mean the subscriber needs to have a forwarded port open to the internet for this to work? Without IPv6, users behind NAT (and specifically behind CGNAT) wouldn't be able to use it.