Hacker News new | comments | show | ask | jobs | submit login
JSON Feed (jsonfeed.org)
486 points by fold 125 days ago | hide | past | web | 213 comments | favorite



> JSON Feed files must be served using the same MIME type — application/json — that’s used whenever JSON is served.

So then it's JSON, and I'll treat it as any other JSON: a document that is either an object or an array, that can include other objects or arrays, as well as numbers and strings. Property names doesn't matter, nor do order of properties or array items, or whatever values are contained therein.

Please don't try to overload media types like this. Atom isn't served as `application/xml` precisely because it isn't XML; it's served as `application/atom+xml`. For a media type that is JSON-like but isn't JSON, you may wish to look at `application/hal+json`; incidentally there's also `application/hal+xml` for the XML variant.

Or as someone else rightly suggested, consider just using JSON-LD.


It's worth pointing out that any valid JSON value is a valid JSON document. There is no requirement or guarantee that an array or an object are the top-level value in a JSON document.

"I am a valid JSON document. So is the Number below, and in fact every line below this line."

4

null


Actually actually... the JSON spec doesn't define the concept of a JSON document. Neither http://www.json.org/ nor http://www.ecma-international.org/publications/files/ECMA-ST... actually specifies that a JSON 'document' is synonymous with a JSON 'value'.

Now it's also true that JSON doesn't specify an entity that can be either an object or an array but not be a string or a bool or a number or null. So it's kind of true that JSON doesn't say that an object or array are valid root elements.

But JSON also says "JSON is built on two structures" - arrays and objects. It defines those two structures in terms of 'JSON values'. But it's a reasonable way to read the JSON spec to say that it defines a concept of a 'JSON structure' as an array or object - but not a plain value. And then to assume that a .json file contains a JSON 'structure'.

Basically... JSON's just not as well defined a standard as you might hope.

edit: And now I'm going to well actually myself: Turns out https://tools.ietf.org/html/rfc4627 defines a thing called a 'JSON text' which is an array or an object, and says that a 'JSON text' is what a JSON parser should be expected to parse.

So - pick a standard.


JSON is in fact defined in (at least) six different places, as described in the piece 'Parsing JSON is a Minefield' [1] (HN: [2]).

The problem is perhaps not as egregious as with "CSV" -- which is more of a "technique" rather than a format, despite after 30 years of customary usage, someone retroactively having written a spec; but it does manifest in various edge cases like we're debating.

[1] http://seriot.ch/parsing_json.php [2] https://news.ycombinator.com/item?id=12796556


Why are you referencing the obsolete rfc? There is no restriction to object/array for the JSON text in the current rfc https://tools.ietf.org/html/rfc7159


The current RFC recommends the use of an object or array for interoperability with the previous specification. JSON being a bit of a clusterf* of variants, they tried to make the RFC broad then place interoperability limitations on it. (lenient in what you accept, etc etc)


Because I just discovered that there was, at least once, a specification that actually defined JSON that way, where previously I had thought it had only been ambiguously described, and I thought that was interesting.


> There is no requirement or guarantee that an array or an object are the top-level value in a JSON document.

Alas, if only that were true.

RFC 4627:

> A JSON text is a serialized object or array. The MIME media type for JSON text is application/json.

RFC 7159:

> A JSON text is a serialized value. Note that certain previous specifications of JSON constrained a JSON text to be an object or an array. Implementations that generate only objects or arrays where a JSON text is called for will be interoperable in the sense that all implementations will accept these as conforming JSON texts.

IIRC, Ruby's JSON parser was written to be strictly RFC 4627 compliant, and yields a parser error for non-array non-object texts.

Since JSON isn't versioned so no one has any idea what "JSON" really means, or what "standard" is being followed.


You're right, thanks for the correction! Also kind of reinforces my point I feel. That any JSON document is just that, a JSON document; it doesn't carry more semantics just because you say so. My JSON parser will still just see simple JSON values, no matter how much I tell it that a certain key should really be a URL, not just a string.


True, but that's also true of any XML, RSS, Atom, HTML, etc. Websites abuse HTML all the time, and there's nothing saying that just because something is transferred with application/atom+xml that it will be valid or follow the spec.

It's more of a social agreement. If you get a JSON object from a place you expect a JSON Feed and it has a title and items, then it'll probably work, even if it omits other things.


So we can ditch media types altogether then? What's the point of having actual contracts if all we need is a hand shake and a wink? We're not talking about malformed data here, that's something different entirely and yes – it happens all the time. We're talking about calling a spade a spade.

If it's JSON your program expects then I should be able to throw any valid JSON at your program and it should work. Granted, it probably won't be a very interesting program precisely because JSON is just generic data without any meaningful semantics.

This spec is entirely about attaching semantics to JSON documents, but all that gets lost when you forget to let people know the document carries semantics and just call it generic JSON. Maybe that doesn't matter to a JSON-feed specific app that thinks any JSON is JSON-feed (an equally egregious error) but if there's an expectation that I should be able to point my catch-all program (i.e. web browser) at a URL and it should magically (more like heuristically I guess, potato/tomato) determine that the document retrieved isn't in fact just any JSON then things are about to get real muddy. Web browsers aren't particularly social, so I suspect a social agreement probably won't work that well.

Media types aren't just something that someone thought was a nifty idea back in the dizzy, they are pretty important to how the web functions.


If it's JSON your program expects then I should be able to throw any valid JSON at your program and it should work.

That's not a valid argument, because JSON is just a serialization format for an arbitrary data structure. You can't throw any arbitrary data structure at any program that accepts data and expect it to be able to accept it. Every program that accepts input requires that input to be in a specific format, which is nearly always more specific than the general syntax of the format. And aside from programs that make strict use of XML schemas, they pretty much all use the handshake-and-wink method for enforcing the contract. (Or to put it another way: documentation and input validation.)

My take on the author's approach is that the content-type is specifying the syntax of the expected input, and the documentation specifies the semantics and details of the data structure. In that respect, the program works like most other programs out there.


Aww that's not fair – if you're going to quote then don't cherry pick and remove the relevant bits.

> If it's JSON your program expects then I should be able to throw any valid JSON at your program and it should work. Granted, it probably won't be a very interesting program precisely because JSON is just generic data without any meaningful semantics.

(Emphasis mine.)

By doing this you're just reinforcing my argument that just parsing any ol' plain JSON won't make for very interesting programs. JSON is just plain dumb data, it doesn't tell you anything interesting. There may be additional semantics you can glean from a document than just its format (HTML is pretty good for this, but oddly enough not a very popular API format) if there are mechanisms to describe data in richer terms – but JSON has none of these. Yet this spec says you should serve this as just plain ol' boring JSON.

> And aside from programs that make strict use of XML schemas, they pretty much all use the handshake-and-wink method for enforcing the contract.

This is just not true. Case in point: web browsers – arguably one of the most successful kind of program there ever was, with daily active users measuring in the billions – make heavy use of meta data including media types to determine how to interpret the input. Not just by way of format (i.e. media type) but also by way of supplemental semantics (e.g. markup, micro formats, links.)

> My take on the author's approach is that the content-type is specifying the syntax of the expected input, and the documentation specifies the semantics and details of the data structure.

Which could and should be described in a spec, with a corresponding IANA consideration to include a new and specific media type in the appropriate registry – not by overloading an existing one.


I'm not sure what you're arguing. JSONFeed is JSON, unless I'm missing something, just JSON that matches a specific schema.

If I'm pulling JSON from any API, I expect it to match a certain schema. If I expected { "time": 10121} from a web API they send me "4", then sure, that's valid JSON, but it doesn't match the schema they promised me in the API.

Something that's JSON should be marked JSON, even if we're expecting it to follow a schema.


> JSONFeed is JSON, unless I'm missing something, just JSON that matches a specific schema.

Yes, and everything is application/octet-stream, so why have mime types? Because it helps with tooling, discovery, and content negotiation. It is a hint for the poor soul who inherits some undocumented ruby soup calling your endpoint.

Being as specific as possible with mine types is a convention for a reason. Please don't break it unless you have an explicit reason to.


This is exactly one of the things that media-types solve. Simply using application/json doesn't tell me (consumer) anything about the semantics of what I'm reading. It only tells me what "parser" to use. If we have a proper media-type, like application/hal+json, I know exactly how to create a client for that type: I need to use a JSON parser _and_ use the vocabulary defined by HAL…


> Something that's JSON should be marked JSON, even if we're expecting it to follow a schema.

That's what the +json type suffix is for. I wonder how many people in this thread actually have read the mediatype RFCs, because they definitely don't encourage using mediatypes in the way you're describing.

The whole point of mediatypes is to make it possible to distinguish schemas while also potentially describing the format that the schema is in.


this tool may help to validate and format JSON data, https://jsonformatter.org


Beware that many JSON parsers don't agree with this, although your interpretation is the correct interpretation of the spec. Some parsers will only accept either an array or object. If you're building a JSON endpoint you'll be safest returning either an array or object.


true


false


Absolutely agree about the MIME type.

Someone filed an issue and created a pull-request for this after you wrote this comment.

https://github.com/brentsimmons/JSONFeed/issues/22

https://github.com/brentsimmons/JSONFeed/pull/23

I hope they will merge it.


That's great, thanks for sharing!


Should this web page have been served as `text/hacker-news-comment-thread+html`?


No. HTML is not formally recognized to be a 'Structured Syntax' of upon which semantically richer standalone mediatypes can be built [1]. This is because existing deployments favor a different approach of imbuing additional semantics inside HTML documents -- microformats -- which place the mechanism of understanding on an opportunistic parser, vs. a restrained one that only executes if its preferred mediatype is advertised. Appendix A of RFC 3023 [2] offers a thorough treatment of this matter. Not defining +html is essentially a concession that enables the two schools of thought to coexist side-by-side.

This is the same difference in schools that I express in a different comment [3] in this thread.

[1] https://tools.ietf.org/html/rfc6839 [2] https://tools.ietf.org/html/rfc3023#appendix-A [3] https://news.ycombinator.com/item?id=14361842


No, because this web page isn't using a specialized format.

But if it were using XHTML, then the proper mime type would be application/xhtml+xml.


It seems that all this spec is, is a structure for an api response. I don't see why it should have a different media type.


I don't believe it should be just application/json because it's a specific format of json. There could be multiple json representations of the feed other than jsonfeed that the server supports and the client could define which ones they Accept.

So the server could support all of the following:

application/jsonfeed

application/rss+xml

application/atom+xml

Who knows, maybe RSS and ATOM could be represented in JSON and have the following mime types:

application/rss+json

application/atom+json

If it's just an API response, and it is your API for an application called Widget Factory, then you can, if you want, have your own format:

application/vnd.widgetfactory+json

Generally, defining such a mime type should have some specification describing it otherwise no client can reliably implement a compatible client. Jsonfeed have proposed that specification.


Well, if JSON had namespaces or standard validation framework, we could have that conversation.


You mean, like JSON schema?

http://json-schema.org/

Not sure why you want to emulate XML namespaces in JSON, but JSON schemas can include other JSON schemas and extend upon other JSON schemas. That accounts for 99.9% of the use cases for namespaces.


That's my point though – it doesn't have anything to describe metadata, so therefor trying to cram in additional semantics is futile if you still want to call it JSON. Call it something else and you can attach whatever semantics you'd like, but they think it should be served up as `application/json` which means all those semantics go out the window.


Do we really need this? Atom is fine for feeds. Avoiding XML just for the sake of avoiding XML, because it isn't "cool" anymore is just dump groupthink.

If this industry has a problem, it's FDD - Fad Driven Development and IIICIS (If It Isn't Cool, It Sucks) thinking.


Part of me is with you. But even in established languages I've had trouble finding an appropriate xml parser and had to tweak them way more than I thought necessary. I haven't (yet) had that problem with JSON.

I think with something like feeds there's the possible benefit of becoming a 'hello world' for frameworks. Many frameworks have you write a simple blogging engine or twitter copycat. I don't think I've ever seen that for a feed reader/publisher. People have said that Twitter clients were an interesting playground for new UI concepts and paradigms because the basics were so simple (back when their API keys were less restrictive). Maybe this could be that?


But even in established languages I've had trouble finding an appropriate xml parser and had to tweak them way more than I thought necessary. I haven't (yet) had that problem with JSON.

Maybe it's just that I work mostly with JVM languages (Java, Groovy, etc.) but I haven't had any problems with handling XML - including Atom - in years. But I admit that other platforms might not have the same degree of support.


Most of my experience is from Python. Each time I use it I have to look at the docs for etree (a library that ships with Python). We would hit performance and feature support issues with etree and tried lxml but had binary compatibility issues between our environments.

The Hitchhiker's Guide to Python[1] (a popular reference for Python) recommends untangle[2] and xmltodict[3], neither of which I've used.

I feel like in other languages I've used had similar brittleness when dealing with xml. I might be biased because working with xml in an editor it's difficult to validate visually or grok in general when used in practice.

[1] http://python-guide-pt-br.readthedocs.io/en/latest/scenarios...

[2] https://untangle.readthedocs.io/en/latest/

[3] https://github.com/martinblech/xmltodict


Beautiful Soup is alright in most cases. JSON is handled much better than any XML library I've seen so far though.


Oh yes, I've used Beautiful Soup, too. If I remember correctly I had great luck with html, but issues with xml. It also is only a reader, not a writer.


> Maybe it's just that I work mostly with JVM languages (Java, Groovy, etc.) but I haven't had any problems with handling XML

Yeah, no surprise. XML may as well be a native data-type in most core JVM languages.

It's not the case everywhere else however.


What language are you using that doesn't have a working XML parser? REALLY?


He said appropriate XML parser.

All languages have XML parsers, it's more that a lot suck, they might have weird concepts you have to use, or are constantly tripping you up with namespaces, or make it really hard to write xpath queries.


> or are constantly tripping you up with namespaces

You mean requires that you understand the XML format you are working with? Oh noes!

Namespaces exist, just about everywhere in the world of programming, and they do so for a reason.

<bar /> is not the same as <foo:bar /> just like http://bar.com is not the same as http://bar.foo.com.

If that's putting the bar high, I really think I may be suffering a huge disconnect from the rest of my peers in terms of expected capabilities.

Just because JSON doesn't have namespacing-capabilities at all, doesn't make it a worthless feature. It's actually what gives you the eXtensibility in XML. As a developer I expect you to understand that.

(And I wonder how long time it will take before the JS-world re-implements this XML-wheel, while again doing so with a worse implementation)


The reason why many developers hate XML namespaces isn't the concept but the implementations which force you to repeat yourself everywhere. I think a significant amount of the grumbling would go away if XPath parsers were smart enough to assume that //tag was the same as //default-and-only-namespace:tag, or at least allowed you to use //name:tag instead of //{URI}tag because then you could write against the document as it exists rather than mentally having to translate names everywhere.

Yes, you can write code to add default namespaces when the document author didn't include them and pass in namespace maps everywhere but that's a lot of tedious boilerplate which requires regular updating as URLs change. Over time, people sour on that.

It really makes me wonder what it'd be like now if anyone had made an effort to invest in making the common XML tools more usable and other maintenance so e.g. you could actually rely on using XPath 2+.


> (And I wonder how long time it will take before the JS-world re-implements this XML-wheel, while again doing so with a worse implementation)

I'm going to guess never. I'm also going to guess that there isn't a single flamewar in the entire history of JSON where someone was trying to figure out how to implement anything close to XML namespaces in JSON. And by "close", I mean something that would require changes to JSON parsers and/or downstream APIs to accommodate potentially bipartite keys.


You never know. This is what they said about schemas too not many years back.


Have there been any discussions whatsoever about adding some sort of namespacing mechanism to JSON?


Well, there's JSON-LD (JSON Linked Data) already.

It's for making interoperable APIs, so there is a good motivation for namespaces. But the namespaces are much less intrusive than XML namespaces. Ordinary API consumers don't even have to see them.

One of the key design goals of JSON-LD was that -- unlike its dismal ancestor, RDF/XML -- it should produce APIs that people actually want to use.


Thanks, I haven't explored JSON-LD before.

But that's not a case of adding namespaces to JSON, is it?

What I mean is if one were to take the skeptical position that JSON is going to end up "re-inventing the XML wheel", that would mean JSON advocates would need to push namespaces into the JSON spec as a core feature of the format. I've never read a discussion of such an idea, but I'd like to if they exist.

edit: clarification


Well, yeah, perhaps the craziest thing about XML is that it has namespaces built into its syntax with no realistic model of how or why you would be mashing up different sources of XML tags.

Namespaces are about the semantics of what strings refer to. They belong in a layer with semantics, like JSON-LD, not in the definition of the data transfer format.

I am convinced that nobody would try to add namespaces to JSON itself. Just about everyone can tell how bad an idea that would be.


> Well, yeah, perhaps the craziest thing about XML is that it has namespaces built into its syntax with no realistic model of how or why you would be mashing up different sources of XML tags.

The thing that gets me is that they were added to XML, so the downstream APIs then got mirrored interfaces like createElementNS and setAttributeNS that cause all sorts of subtle problems. With SVG, for example, this generates at least two possible (and common) silent errors-- 1) the author creates the SVG in the wrong namespace, and/or 2) more likely, the author mistakenly sets the attribute in the SVG namespace when it should be created in the default namespace. These errors are made worse by the fact that there is no way to fetch that long SVG namespace string from the DOM window (aside from injecting HTML and querying the result)-- judging from Stackexchange users are manually typing it (often with typos) into their program and generating errors that way, too.

Worse, as someone on this site pointed out, multiple inline SVGs can still have attributes that easily suffer from namespace clashes in the <defs> section. It's almost comical-- the underlying format has a way to prevent nameclashes with multiple attributes inside a single tag that share the same name-- setAttributeNS-- but is no help at all in this area.

edit: typo and clarification


XML parsers have a pretty bad track record for security vulnerabilities. If I was writing code to distribute that was going to be parsing arbitrary data from third parties (which is the RSS/Atom use case), I would be more comfortable trusting the average JSON parser than the average XML parser.

Otherwise, I agree with the "if it ain't broke" principle. There's also cases where so much ad hoc complexity is built on top of JSON that you end up with the same problems XML has, except with less battle-tested implementations.


As terrible as XML parsers can be, they've never been as bad as "XMLdoc = eval(XMLString)". I'd be more likely to trust a JSON parser not written in JavaScript than an arbitrary XML parser, but that's only because of the XML specification itself, which includes such features as including arbitrary content as specified by URLs (including local (to the parser) files!). Great ideas when you can trust your XML document, not so great otherwise.


modern browsers don't internally call eval(). See e.g. the definition of JSON.parse in v8: https://chromium.googlesource.com/v8/v8/+/4.3.65/src/json.js...


And modern XML parsers aren't full of vulnerabilities anymore. You're missing the point.


It is very likely than I am an idiot, but I've always found parsing XML too hard, specially compared to JSON which is almost too easy.


Whether parsing XML is easy or hard, how often do you actually write an XML parser? If I'm digesting a JSON/XML document, I resort to a parser library for the language that I'm using at that point, so the complexity of writing such a parser is pretty much non-existent. Definitely not a compelling reason to switch to JSON.


Most XML parsers I've used are leaky abstractions. Even once the document is parsed, actually accessing the data can require a lot more complexity than accessing parsed JSON data.


IIRC, the popular C++ implementations were glorified tokenizers. It was up to you to figure out which tokens were data and how those tokens related to each other.


Ah, SAX. People built some true horrors with that API, just because it was "more performant" than DOM. Never mind that their hacked-together tree builders often leaked like sieves.


If there was an `XML.parse` just like there's `JSON.parse`, I doubt you'd say the same. As it stands, the added complexity in JS-land is to import a library that provides this functionality for you. Fortunately there are many, but I agree a built-in would be nice. It's a bit of a shame that E4X never landed in JS.


It's more than JUST library support. It's also that JSON deserializes into common native data types naturally (dictionary, list, string, number, null).

You can deserialize XML into the same data types, but it's not anywhere near as clean because of how extensible XML is. That's a big part of what's made JSON successful.


Right, but you inevitably end up with boilerplate "massage" code around your data anyway. Case in point: dates, any number that isn't just a number e.g. currencies or big numbers, URLs, file paths, hex, hashes. Basically any type that carries any kind of semantics beyond array, object, string, number, or null will require this boilerplate, only that your data format has no way of describing them except for out-of-band specs, if you want to call them such.

At least XML has schemas, and even if all you're doing is deserializing everything into JsonML like objects you're still better off because you'll have in-band metadata to point you in the right direction.


CBOR [1] allows the semantic tagging of data values and makes a distinction between binary blobs (a collection of 8-bit values) and text (which is defined as UTF-8).

[1] RFC-7049. Also checkout http://cbor.io/


IMHO the boilerplate code is much easier to read than understanding the nuances of XML if I have to read a document.

{"type":"currency", "unit":"euro", "amount": 10}

feels easier to understand than

<currency unit="euro">10</currency>


Maybe it's just conditioning, but I find the latter example easier to read and understand. In fact, I'd say that - in general - I find XML better in terms of human readability than JSON. I guess it just goes to show that we all see certain things differently. shrug


I think that's totally reasonable - because it was after all one of the goals of XML. That is, to be human readable.

There is a difference however between readable + parsable vs parsable + easily dealt with.

XML was not the latter. You have to do more work to traverse and handle XML inside your application than you do JSON, and most of the (reasonable) reasons for this are due to features that most cases don't need.

JSON makes the common case easy, XML doesn't.


How about:

<rec type="currency" unit="euro" amount="10" />

I don't think your problem is with the syntax, necessarily. It seems more like you prefer name/value pairs over semantic markup.


The biggest problem with XML is how easy it is to make a very bad schema, and how hard those can be to parse


Also, for what it's worth, your point is exactly why I mentioned E4X. Sure wasn't a panacea, but it had some things going for it.


E4X was exceedingly great. Loved it.


This is really only true in dynamically typed languages. From personal experience: parsing json in Java or Go without just treating everything as a bag of Object or an interface{} requires a ton of menial boilerplate work.

Super nice in python/ruby/javascript, though.


Swift 4 will have JSON encoding/decoding built in, and I wouldn't be surprised to see such a feature spring up in other modern languages too. Once that boilerplate is eliminated, json is a pretty decent solution.

https://www.hackingwithswift.com/swift4


I am very stoked for this.


> parsing json in Java or Go without just treating everything as a bag of Object or an interface{} requires a ton of menial boilerplate work.

From my experience in Java it is pretty simple using a library like Jackson. You define the types you expect to read, maybe add some annotations, and then it's one function call to deserialize. IIRC Go has something similar in its json library.


Yes, it's arguably nicer in Go, because you specify exactly what types and field names you expect, and then it's just a simple json.Unmarshal


Sure--it's kind of a pain in Swift, too.

Wouldn't it just be worse with XML, though? I get that people don't realistically parse it themselves and libraries are smart enough to use schemas to deserialize, but there's nothing inherent about JSON that makes it unable to conform to a schema or be parsable by libraries into native strongly-typed objects the same way.


Except JSON doesn't have semantics to describe schemas, only arrays, objects, strings, numbers and null. You can say "but this key is special" but then it's not JSON anymore. And if you're ok with that, may as well just use JSON-LD or some other JSON-like format.


Idk, I think JSON parsing is pretty ergonomic in Rust, definitely nicer than your typical XML DOM.


Of course, JSON doesn't support complex numbers, bignums, rationals, cryptographic keys &c. And it'd be even worse than XML to try to represent programs in.

JSON is definitely easier to decompose into a simple-to-manipulate bag-of-values than is XML.


XML is fundamentally much more complex than JSON so any XML parsing library will inevitably present more complicated API. I kinda like XML (!), but there is no point pretending that using it is as simple as JSON.


I think that depends what you mean by "using it".

XML can convey a lot more semantic meaning than JSON ever will, and standardisation of things like XPath, DOM, XSLT, etc provides a lot of power when working with XML documents.

With JSON, essentially everything is unknown. You can't just get all child nodes of an object, or get all objects of a certain type, using standard methods. You need to know what object key 'child' nodes are referenced by, or loop through them all and hope that what you find is actually an array of child nodes, and not e.g. an array of property values. Finding all objects of a given type means knowing how the type is defined, AND the aforementioned "how do i get child nodes" to allow you to traverse the document.

Of course that assumes what you have is a document, and not just a string encoded as JSON. Or a bool/null.

My point is, the tooling around XML is very mature. "Use" of a data format is a very broad topic, and covers a lot more than just "i want to get this single property value".


Absolutely. Right tool for the right job. Mixed content (perhaps a paragraph with bold and italics) is absolutely horrible in JSON because it lacks the complexity that XML has to cope with this.


You're basically saying that this isn't technically better, just more socially acceptable right now. I think you're right, but it seems to me that Atom's problem is primarily a social one. So even if this doesn't carry any technical advantages, a format with a strong social "in" is precisely what we need to make feeds a thing again.


To be honest, I'm really excited about the prospect of JSON based feeds. Right now, there's no easy way to work with Atom/RSS feeds on the command-line (that I know of anyway), which is something I often wish I could do. With a JSON feed, I can just throw the data at jq (https://stedolan.github.io/jq/) and have a bash script hacked together in 10 minutes to do whatever I want with the feed.


I give you libxml:

    xmllint --xpath '//element/@attribute'
There's a good chance it's already installed on your mac.


To avoid the hassle of handling xml namespaces (e.g. in an Atom feed...), just do:

    xmllint --xpath '//*[local-name()="element"]/@attribute'
Note: for consistency, namespaces are not needed for attribute names.

http://stackoverflow.com/questions/4402310/how-to-ignore-nam...


There are a few nice XML processing utilities. I tend to use xmlstarlet and/or xidel. This lets me use XPath, jQuery-style selectors, etc.

I agree that jq is really nice though. In particular, I still find JSON nicer than XML in the small-scale (e.g. scripts for transforming ATOM feeds) because:

- No DTDs means no unexpected network access or I/O failures during parsing

- No namespaces means names are WYSIWYG (no implicit prefixes which may/may not be needed, depending on the document)

- All text is in strings, rather than 'in between' elements

- No redundant element/attribute distinction

Even with tooling, these annoyances with XML leak through. As an example, xmlstarlet can find the authors in an ATOM file using an XPath query like '//author'; except if the document contains a default namespace, in which case it'll return no results since that XPath isn't namespaced.

This sort of silently-failing, document-dependent behaviour is really frustrating; requiring two branches (one for documents with a default-namespace, one for documents without) and text-based bash hackery to look for and dig out any default namespace prior to calling xmlstarlet :(

http://xmlstar.sourceforge.net

http://www.videlibri.de/xidel.html


I have an RSS client written in Rust that builds as a command line program.[1] I wrote this in 2015, and it needs to be modernized and made a library crate, but it will build and run with the current Rust environment. It's not that hard to parse XML in Rust. Most of the code volume is error handling.

[1] https://github.com/John-Nagle/rust-rssclient


Surely there's an xml->json converter somewhere.


It's kind of tough to convert XML directly to other formats (including, but not limited to, JSON), because there are a lot of XML features that don't map cleanly onto JSON, such as:

• Text nodes (especially whitespace text nodes)

• Comments

• Attributes vs. child nodes

• Ordering of child nodes


As it happens, XSLT 3.0 and XPath 3.0 both have well documented and stable features for doing exactly this. Roundtripping XML to JSON and back is a solved problem - check it out some time; it may surprise you.


Are you talking about json-to-xml and xml-to-json?

From the XSLT spec [0]:

"Converts an XML tree, whose format corresponds to the XML representation of JSON defined in this specification, into a string conforming to the JSON grammar"

It can't take an arbitrary XML document and turn it into JSON, it can only take XML documents that conform to a specific format.

You can safely round-trip from JSON to XML and back to JSON. That's trivial because JSONs feature set is a subset of XMLs.

What you can't safely do is round-trip from arbitrary XML to JSON and back to XML. That's because, as the parent said, there are features in XML that don't exist in JSON. That means you are forced to find a way to encode it using the features you do have, but then you can't tell your encoding apart from valid values.

[0] https://www.w3.org/TR/xslt-30/#func-xml-to-json


You could conceivably serialize the DOM as a JSON object, but the representation would be very difficult to work with:

    {
      "type": "element",
      "name": "blink",
      "attributes": {
        "foo": "bar"
      },
      "children": [
        {
          "type": "text",
          "content": "example text"
        }
      ]
    }


Once you've peeked at the complexity of some of the xml parsers (like xerces, oh god xerces) undoubtedly you'll want to avoid it like the plague. xml can get crazy-bananas very quickly. I fundamentally don't understand xml (just like I don't understand asn1) for anything beyond historical purposes.


The Atom spec is really easy to grasp. Your platform may even include a way to deal with it ([.NET](https://msdn.microsoft.com/en-us/library/system.servicemodel... for example)


There are definitely complexities in the XML ecosystem, like XLink, schemas, namespaces, etc. But in practice, not every application needs all that stuff, and when using the "common" parts of XML, I don't find it difficult to understand or work with. But that's just me.


Yep, we don't really need another syndication format that no reader is going to support or support well for years. All I see missing in RFC 4287 is the lack of a per-entry cover image/thumbnail, which you can solve with an extension (which no one supports, and that's kind of the point) anyway.


JSON, given the same schema, will always be more efficient byte-for-byte than XML. In addition, JSON as a format is native to JavaScript, which itself is ubiquitous. That's not even mentioning raw readability/writability.

Basically, XML is to JSON as SOAP is to REST. It had it's day, though it's obviously still useful, but we have better tools now. Frankly, I'm surprised we haven't seen a proposal like this sooner.


> XML is to JSON as SOAP is to REST

That's true. Both XML and SOAP are well defined, and well structured.

JSON and REST are both marginally defined, and thus we see constant incompatible/incomplete implementations, or weird hacks to overcome the shortcomings.

> we have better tools now

I think "the cool kids are cargo-culting something newer now" is probably more accurate.


Nitpick: REST is very well defined. It's not just a protocol, like some people insist.

Other than that, fully in agreement.


Rest is effectively a concept, and its up to developers to follow the rules it sets.

You can't take your codebase, add some glue code to a REST module, and know that it will be usable by any other REST consumer/client, because no one follows the guidelines exactly the same way.


Yeah this is great, now instead of properly machine-readable and verifiable XSD files we have pseudo-RFC text on some shitty GitHub page.


Part of me is also with you - JSON is indeed smaller than XML , but we do have gzip almost everywhere around the web, and with gzip, they don't have that much difference on space. Also, if people really care about this, why don't they use binary format, such as something like protobuf?

And the other part of me is not with you - manipulating XML is not as easy as JSON in most of my development time, and sometimes I even need to write something by my bare hands, which JSON is much more handy. Tons of other formats are more human-friendly than JSON, for example TOML, but they don't have the status JSON has. So I guess JSON is kinda choice under the current state of "web development times".


In practice, json is much easier to work with on the command line because of jq.


No, we don't. This doesn't do anything except break compatibility.


Yikes, you didn't even make it to the second sentence.

> JSON is simpler to read and write, and it’s less prone to bugs.


JSON is simpler to read and write, and it’s less prone to bugs.

I don't actually find either of those things to be true.


I simply think you're lying to yourself. It's both literally and theoretically simpler to write and digest let's start with the simplest case, {}. Prone to bugs is a matter of debate, depending on a number of factors.


That's a fine opinion to have, but that doesn't mean that people (the authors or devs generally) use JSON out of vanity. As an aside, you're the first person I've heard suggest people put their identity in serialization format, which gave me a good laugh.


From the HN guidelines:

Please don't insinuate that someone hasn't read an article.


My mistake for phrasing my point in a manner that violates HN guidelines. I tried to edit, but I missed the window. At any rate, my point stands.


We don't prefer JSON to XML for any reason other than that XML is terrible by comparison.


It's funny to me that at the same time people are flocking to languages with strong, flexible type systems (often with compile-time checks), we are fleeing from a strongly typed data interchange format in favor of a dynamic bag of objects and arrays.


I think that's because even if the data interchange format is strongly typed, as a consumer you often still must expect _anything_.

I've yet to work on a project that handles XML where we have a XSD prevalidation step that makes the reading of some deeply nested XML tag feel safe.

Unless we count XML <-> data object binding back in the java days. Not sure that felt any better...


On the flip side, I've only ever not had an XSD when I was building something myself and actively didn't care.

The truth, I tend to suspect, lies somewhere in between. =)


That's not a reason.


For anyone who's tried to write a real-world RSS feed reader, this format does little to solve the big problems the newsfeeds have:

* Badly formed XML? Check. There might be badly formed JSON, but I tend to think it'll be a lot less likely.

* Need to continually poll servers for updates? Miss. Without additions to enable pubsub, or dynamic queries, clients are forced to use HTTP headers to check last updates, then do a delta on the entire feed if there is new or updated content. Also, if you missed 10 updates, and the feed only contains the last 5 items, then you lose information. This is the nature of a document-centric feed meant to be served as a static file. But it's 2017 now, and it's incredibly rare that a feed isn't created dynamically. A new feed spec should incorporate that reality.

* Complete understanding of modern content types besides blog posts? Miss. The last time I went through a huge list of feeds for testing, I found there were over 50 commonly used namespaces and over 300 unique fields used. RSS is used for everything from search results to Twitter posts to Podcasts... It's hard to describe all the different forms of data it can be contain. The reason for this is because the original RSS spec was so minimal (there's like 5 required fields) so everything else has just been bolted on. JSONFeed makes this same mistake.

* An understanding that separate but equal isn't equal. Miss. The thing that http://activitystrea.ms got right was the realization that copying content into a feed just ends up diluting the original content formatting, so instead it just contains metadata and points to the original source URL rather than trying to contain it. If JSONFeed wanted to really create a successor to RSS, it would spec out how to send along formatting information along with the data. It's not impossible - look at what Google did with AMP: They specified a subset of formatting options so that each article can still contain a unique design, but limited the options to increase efficiency and limit bugs/chaos.

This stuff is just off the top of my head. If you're going to make a new feed format in 2017, I'm sorry but copying what came before it and throwing it into JSON just isn't enough.


FWIW, This is by Manton Reece and Brent Simmons. And Simmons is known (among other things) as the creator of NetNewsWire which has been around for more than 15 years. He does know a bit about Atom and RSS feeds.

https://en.wikipedia.org/wiki/NetNewsWire


Ok, I have no idea who these guys are so forgive me being rude: if they're so good then why did they not address those points? to my eyes, op makes a solid argument. I'd like to know their side of the story.


But they did...

> Badly formed XML? Check. There might be badly formed JSON, but I tend to think it'll be a lot less likely.

The problem with XML is mostly that it is a very complex format so the bugs are more probable and there are more pitfalls.

> Need to continually poll servers for updates? Miss. Without additions to enable pubsub, or dynamic queries ...

They actually did add tags to enable WebSub (previously called pubsub) so there goes that. For the other concerns, I think it is not the formats job to care for partial or incomplete data. Nothing prevents you to have a dynamic link with a "updatesSince" on your webpage and serve all of the articles that were added or updated after that. Nowhere, the format specifies the limit on number of items. It also incorporates paging out of the box so you could bubble up any old articles.

> Complete understanding of modern content types besides blog posts? Miss.

The point of this is for the open web, by definition nobody can anticipate all formats. Rather than fill the spec with tweets, facebook and other types, they have opted for the least common denominator and added a specific way to add extensions. This makes way more sense.

* An understanding that separate but equal isn't equal. Miss.

Nothing actually prevents you to leave the content fields blank and rely on the reader to pull the format. But for this kind of usage there are other methods. Personally I prefer content delivered in the RSS precisely to avoid to have to deal with customization of content formatting. JSON feed HAS a way to specify formatting though, it's called HTML tags. No need to reinvent the wheel here.


I don't agree with most of what you wrote, but the "it's called HTML tags" is the most wrong. You must not have tried this any time in the past 5 years or so. The embedded tags come out of CMSs and - when they're not stripped completely - look like <div class="title-main-sub-1"> and <span class="sub-article-v5-bld">. HTML isn't used alone, it's always used with CSS nowadays, and no matter if semantic tags are best practice, the fact is it's optional and regularly not used. If they're going to create a new standard format, they need to address this.


What is the difference between re-publishing the content in some other format which will do formatting well and re-publishing the content using sensible html tags with maybe some embedded minimal stylesheet?

There might be mis-use and abuse, but if you want to avoid that you can always push markdown into the "text" representation.


One has to wonder whether Simmons is just trying to revive the old RSS ecosystem. "What do developers like these days, JSON? Let's do RSS in JSON!" ... This does not help.

The real challenge these days is to replicate the solutions Facebook and Twitter brought to feeds (bidirectionality and data-retention in particular) in a decentralised manner that could actually become popular. Simply replicating RSS in the data-format du jour is not going to achieve that.


> Need to continually poll servers for updates? Miss. Without additions to enable pub sub, or dynamic queries, clients are forced to use HTTP headers to check last updates, then do a delta on the entire feed if there is new or updated content.

This is backwards, imo. The advantage of polling over pub sub is that all complexity is offloaded to the client. This comes with its own set of problems (inefficiency of reinventing the wheel across all clients, plus every client will implement that complexity differently resulting in countless bugs), but this is what drives adoption, which as someone else here has pointed out is all that matters. If you want adoption, you seemingly need to sacrifice a lot of efficiency in favour of making it stupidly easy to publish.

The "it's 2017 now" argument doesn't really address that even with dynamically generated content, you still need every dynamic serverside platform to adopt and implement your spec independently. Static is always easier. (plus with the recent trend towards static sites, "it's 2017 now" actually has the opposite implication).


Plus, you can always reuse PubSubHubBub (now WebSub[1]), which is already used in RSS/Atom feeds to provide optional subscribing to updates if both the server and client support it.

[1] https://www.w3.org/TR/websub/


The thing that http://activitystrea.ms got right was the realization that copying content into a feed just ends up diluting the original content formatting, so instead it just contains metadata and points to the original source URL rather than trying to contain it.

It's a shame that ActivityStrea.ms hasn't had more uptake. We've added support in our enterprise social network product and think it enables some cool scenarios. But unfortunately too few other products support it these days.


> Need to continually poll servers for updates? Miss.

The point of these syndication formats (RSS, Atom, now this) was always to act as the "I'm a static site and webhooks don't exist, so poll me" equivalent of webhooks. These "pretending to be webooks" were supposed to hook into a whole ecosystem of syndication middleware that turned the feeds into things like emails.

And that—the output-products of the middleware—was what people were supposed to consume, and what sites were meant to offer people to consume. The feed, as originally designed, was not intended for client consumption. That's why the whole model we have today, where millions of "feed-reader" clients poll these little websites that could never stand up to that load, seems so silly: it wasn't supposed to be the model. RSS feeds were supposed to be a way for static-ish content to "talk to" servers that would do the syndicating for them; not a format for clients to receive notifications in.

(And we already had a format for clients to receive notifications in: MIME email. There's no reason you can't add another MIME format beyond text/plain and text/html; and there's no reason you can't create an IMAP "feed-reader" that just filters your inbox to display only the messages containing application/rss+xml representations, and set up your regular inbox to filter out those same messages. And some messages would contain both representations, so you'd see e.g. newsletters as both text in your email client and as links in your feed client, and archiving them in one would do the same in the other, since they're the same message.)

---

The big problem I have with feeds (besides that people are using them wrong, as above) is that they have no "control-channel events" to notify a feed-consumer of something like e.g. the feed relocating to a new URL.

Right now, many feeds I follow just die, never adding a new feed item, and the reason for that is that, unbeknownst to me, the final item in the feed (that I never saw because it rotted away after 30 days, or because I "declared inbox zero" on my feeds, or whatever else) was a text post by the feed's author telling everyone to follow some new feed instead.

And other authors don't even bother with that; they use a blogging framework that generates RSS, but they're maybe not even aware that it does that for them, so instead they tell e.g. their Twitter followers, or Twitch subscribers, that they're moving to a new website, but their old website just sits there untouched forever-after, never receiving an update to point to the new site which would end up in the RSS feed my reader is subscribed to. It's nonsensical.

(And don't get me started on the fact that if you follow a Tumblr blog's RSS feed, and the blog author decides to rename their account, that not only kills the feed, but also causes all the permalinks to become invalid, rather than making them redirect... Tumblr isn't alone in this behavior, but Tumblr authors really like renaming their accounts, so you notice it a lot.)


HTTP 301 Moved Permanently is the out of band control channel. Sometimes it even seems to work, depending on software of course.

There was also a typical Dave-Wineresque invention of replacing the old feed with some special, non-namespaced XML with the redirect: http://www.rssboard.org/redirect-rss-feed

But of course the real problem is social. As in people simply stop blogging or stop caring. And of course tool developers don't care if someone doesn't want to use their software anymore and don't think of developing the right buttons for this edgecase.


> HTTP 301 Moved Permanently is the out of band control channel.

True, but requires you to be able to set response codes on the server. I can't make my Github Pages site, or my Tumblr blog, or my S3 bucket, emit a 301. And those are the sorts of things that RSS was designed for: static sites that can't just, say, tell their backend to email people on update. You'd think that, knowing that, RSS et al would have been designed with in-band, rather than out-of-band, control.



Dave Winer (the creator of RSS) played with this a bit in 2012. It turns out that exact format of feeds doesn't matter nearly as much as there being a more-or-less universal one.

http://scripting.com/stories/2012/09/10/rssInJsonForReal.htm...


I'm sure there's...

oh of course: https://xkcd.com/927/

(and I realize this doesn't exactly map, as JSON Feed isn't even trying to cover all the usecases of Atom or RSS, just switching the container format)


If we're going to talk about replacing XML with better data formats, why not switch to S-expressions?

    (feed
     (version https://jsonfeed.org/version/1)
     (title "My Example Feed")
     (home-page-url https://example.org)
     (feed-url https://example.org/feed.json)
     (items
      (item (id 2)
            (content-text "This is a second item.")
            (url https://example.org/second-item))
      (item (id 1)
            (content-html "<p>Hello, world!</p>")
            (url https://example.org/initial-post))))
This looks much nicer IMHO than their first example:

    {
        "version": "https://jsonfeed.org/version/1",
        "title": "My Example Feed",
        "home_page_url": "https://example.org/",
        "feed_url": "https://example.org/feed.json",
        "items": [
            {
                "id": "2",
                "content_text": "This is a second item.",
                "url": "https://example.org/second-item"
            },
            {
                "id": "1",
                "content_html": "<p>Hello, world!</p>",
                "url": "https://example.org/initial-post"
            }
        ]
    }


It looks nicer if you happen to like s-expressions. But to me, it's just replacing one flavor of clutter for another. The best reason not to prefer s-expressions to JSON, though, would be simply that one is already natively supported in browsers and the other would need a parser written in a language that already parses JSON.


There's EDN, which is to Clojure what JSON is to JS: a format close to the language's way of representing data.

https://github.com/edn-format/edn

Example:

https://github.com/milikicn/activity-stream-example/blob/4db...

Not S-expression-based, though.


JSON was influenced by Rebol, which i feel would provide an even nicer example:

    version: https://jsonfeed.org/version/1
    title: "My Example Feed"
    home-page-url: https://example.org
    feed-url: https://example.org/feed.json
    items: [
        [
            id: 2
            content-text: "This is a second item."
            url: https://example.org/second-item
        ]
        [
            id: 1
            content-html: "<p>Hello, world!</p>"
            url: https://example.org/initial-post
        ]
    ]


Those two aren't comparable because you cannot distinguish between key:val pairs and lists. You need dotted lists.


Nope, one would normally write the code which parses such S-expressions such that the first atom in each list indicates the function to use to parse the rest of the list. So there'd be a FEED-FEED function which knows that a feed may have version, homepage URL &c., and there'd be a FEED-ITEMS function which expects the rest of its list to be items, a FEED-ITEMS-ITEM function which knows about the permissible components of an item &c.

If you really want to do a hash table, you could represent it as an alist:

    (things
      (key1 val1)
      (key2 val2))
This all works because — whether using JSON, S-expressions or XML — ultimately you need something which can make sense of the parsed data structure. Even using JSON, nothing prevents a client submitting a feed with, say, a homepage URL of {"this": "was a mistake"}; just parsing it as JSON is insufficient to determine if it's valid. Likewise, an S-expression parser can render the example, but it still needs to be validated. One nice advantage of the S-expression example is that there's an obvious place to put all the validation, and an obvious way to turn the parsed S-expression into a valid object.


In the absolute abstract, you are correct. In the absolute abstract, you could replace parens with significant whitespace and have zero visible syntax.

In practice, lisp adopted dotted lists 60 years ago and basically every lisp since has used it as one way to represent an associated list. Minimal syntax is better than zero syntax or loads of syntax.


There is of course sxml, which is more or less well-defined. But that would probably suffer from the same problems as xml, since it is just xml written as s-expressions.

There is one pretty damn solid SSAX parser by Kiselyov that has been ported to just about every scheme out there. It is interesting since it doesn't do the whole callback thing of most ssax parsers, but is implemented as a tree fold over the xml structure.


beauty is in the eye of the beholder - i personally prefer the JSON.


It is worth pointing out that there is a relevant W3C Recommendation "JSON Activity Streams", https://www.w3.org/TR/activitystreams-core/ . I'm not saying JSON Feed is worse, or better. I am saying that I think JSON Feeds adoption requires a detailed comparison between JSONFeed and JSON Activity Streams 2.0.


Thanks +1, didn't know that.


But does it solve any actual problems other than 'XML is not cool', problems big enough to deserve a new format?

It's true that JSON is easier to deal with than XML. But that's relative, there are plenty of decent tools around RSS. From readers, to libraries in the most common programming languages, and extensions in the most common content management systems. JSON is slightly easier to read for human (although that's subjective), but then how often do you need to read the RSS feed manually, unless you are the one who is writing those libraries, etc. But that's a tiny share of all people using RSS.

>>> It reflects the lessons learned from our years of work reading and publishing feeds.

Sounds like the author(s) has extensive experience in this field and knows things better than some random person on the internet (me). But the homepage of the project doesn't convey those learned lessons.


Yes JSON is much easier to parse than XML, and is preferred when it fits such as for most Web API requests and responses.

However, SGML and XML were invented as structured markup languages for authoring of rich text documents by humans, for which JSON is unsuited and sucks just as much as XML sucks for APIs.

Edit: though XML has its place in many b2b and business-to-government data exchanges (financial and tax reporting, medical data exchange, and many others) where a robust and capable up-front data format specification for complex data is required


A few thoughts on the spec itself:

* In all cases (feed and items), the author field should be an array to allow for feeds with more than one author (for instance, a podcast might want to use this field for each of its hosts, or possibly even guests).

* external_url should probably be an array, too, in case you want to refer to multiple external resources about a specific topic, or in the case of a linkblog or podcast that discusses multiple topics, it could link to each subtopic.

* It might be nice if an item's ID could be enforced to a specific format, even if perhaps only within a single feed. Otherwise it's hard to know how to interpret posts with IDs like "potato", 1, null, "http://cheez.burger/arghlebarghle"


> a podcast might want to use this field for each of its hosts, or possibly even guests

I'm going to pretend this is about music artists in a music library, but the logic is exactly the same for podcast hosts:

You tend to want fields like this to be singular, so that the field can be used in collation (i.e. "sort by artist.")

If you have multiple artists for a track, usually one can be designated the "primary" artist—the one that people best know, and would expect to find the track listed under when looking through their library. Usually, then, the rest get tacked on in the field in a freeform, maybe comma-and-space delimited fashion. The field isn't a strict strongly-typed references(Person) field, after all; it's just freeform text describing the authorship.

But as for hosts vs. guests, that's a whole can of worms. Look at the ID3 standard. Even though music library-management programs usually just surface an "Artist" field, you've actually got all of these separate (optional) fields embedded in each track:

• TCOM: Composer

• TEXT: Lyricist/Text writer

• TPE1: Lead performer(s)/Soloist(s)

• WOAR: Official artist/performer webpage

• TPE2: Band/orchestra/accompaniment

• TPE3: Conductor/performer refinement

• TPE4: Interpreted, remixed, or otherwise modified by

• TENC: Encoded by

• WOAS: Official audio source webpage

• TCOP: Copyright message

• WPUB: Publishers official webpage

• TRSN: Internet radio station name

• TRSO: Internet radio station owner

• WORS: Official internet radio station homepage

That gives you separate credits for pretty much the entire composition, production and distribution flow, which usually means that each field only needs one entry.

Would be great if people used them, wouldn't it? Maybe the semi-standard "A feat. B (C remix)" microformat could be parsed into "[TPE2] feat. [TPE1] ([TPE4])"...


I was thinking that all the urls and images should have been in arrays.


Probably, but I think the goal there is to have something that you can display on a summary page with a list of items or episodes, where there's just an icon for each (and a banner image for the background or header or some such), for which purpose I think a single image is fine (I totally get your wanting more than one, though, and I'm happy to be wrong here).


I would suggest specifying titles as html, not plain text. I've seen too many things titled "I <i>love</i> science!" over the years to believe in the idea that titles are plain text.

Also, despite the fact this is technically not the responsibility of the spec itself, I would strongly suggest some words on the implications of the fact that the HTML fields are indeed HTML and the wisdom of passing them through some sort of HTML filter before displaying them.

In fact that's also part of why I suggest going ahead and letting titles contain HTML. All HTML is going to need to be filtered anyhow, and it's OK for clients to filter titles to a smaller valid tag list, or even filter out all tags. Suggesting (but not mandating) a very basic list of tags for that field might be a good compromise.


Allowing HTML means the other side will have to validate that HTML (to avoid XSS). Using text means you can stick in the DOM using innerText() and be much more confident that you aren't injected XSS.

I agree that I see HTML in RSS titles, but I rather have the occasional garbled title that the author can fix by striping out HTML before the RSS than ensuring that every RSS reader isn't opening up new security holes.


There is no way to avoid having to handle HTML safely. There's no point in trying to limit your exposure to that problem when the entire point of this standard is to ship around arbitrary HTML for interfaces to display. Once you've solved the hard problem of displaying the body safely, displaying the title is trivial. Making the title pure text does nothing useful. JSONFeed display mechanisms that are going to get this wrong are going to do things like leave injections in the date fields anyhow.


Following the separation of content_text and content_html attributes, it would make sense to have title_html and title_text attributes.


> It's at version 1, which may be the only version ever needed.

Wow. Now that's confidence. Have you ever read the first version of a spec and thought, "That's just perfect. Any additional changes would just be a disappointment compared with the original"?


MIDI 1.0 is maybe not perfect, but it is still unchanged since 1983. People have tried to replace it for 2 decades, but failed to provide any enhancements worth a switch.

But MIDI doesn't really fit that description since it builds on 2 years of work by Roland. My best bet though.


In all fairness, they're taking a more or less solved problem (feeds), so they don't really have to figure things out there, and they're porting this established solution to a very well-established technology (JSON), so also don't really have to figure stuff out in that sense either.

As far as scenarios where it's feasible to get the answer right the first time go, this is a reasonably realistic one.

EDIT: Also, if you scroll to the bottom of the page you can see they have let a whole bunch of people look at the spec before releasing it, so there has been at least some peer review.


Unsurprising as this is clearly an ego play, given that the first thing they want you to know is their names.


"Now it belongs to the ages!"


> JSON is simpler to read and write, and it’s less prone to bugs.

Less prone to bugs? How's that?


Consider XML entity bombs. You need to explicitly tell your XML parser not to follow the spec to prevent malicious sources of XML from crashing your application. XML also has a lot of room for syntax errors, with many types of tokens and escape rules. JSON, by comparison, does not.


> XML also has a lot of room for syntax errors, with many types of tokens and escape rules. JSON, by comparison, does not.

Parsing JSON is a minefield.

Yellow and light blue boxes highlight the worst situations for applications using the specified parser. Take a look at how a bunch of parsers perform with various payloads: http://seriot.ch/json/pruned_results.png

"JSON is the de facto standard when it comes to (un)serialising and exchanging data in web and mobile programming. But how well do you really know JSON? We'll read the specifications and write test cases together. We'll test common JSON libraries against our test cases. I'll show that JSON is not the easy, idealised format as many do believe. Indeed, I did not find two libraries that exhibit the very same behaviour. Moreover, I found that edge cases and maliciously crafted payloads can cause bugs, crashes and denial of services, mainly because JSON libraries rely on specifications that have evolved over time and that left many details loosely specified or not specified at all."

More details available at: http://seriot.ch/parsing_json.php


None of these issues are as bad as the XML ones. You generally don't need "defusedjson" like you need https://pypi.python.org/pypi/defusedxml

<!DOCTYPE external [ <!ENTITY ee SYSTEM "file:///etc/ssh/ssh_host_ed25519_key"> ]> <root>&ee;</root>


Parser correctness is irrelevant when you're talking about the ability to be written with few syntax errors. For instance, JSON has one type of string with one set of string escape rules. XML has element names, attribute names, attribute values, text nodes, CDATA content, RCDATA content, and more. And almost all of them have different rules for what they can contain and how they can be used.

By comparison, XML is orders of magnitude more complex than JSON.


> XML also has a lot of room for syntax errors,

No it doesn't. XML is either well formed or not, and any parser encountering non well-formed XML will reject it outright.

Therefor all XML in use on the internet is spec-compliant.

Now try to say the same about JSON.


> any parser encountering non well-formed XML will reject it outright.

Ah, I see you're new to parsing XML.


Oh, it will be rejected alright. And then you're forced to override the parser, or to manipulate the XML before parsing it because it makes business sense to not have the source fix their XML for some reason.

People and machines are just utterly incapable of outputting valid XML.


JSON parsers have a much smaller 'feature surface' meaning that there are fewer nooks and crannies for bugs to live in.

One example of a bug that often festered in XML parsers: https://en.wikipedia.org/wiki/Billion_laughs (there is no JSON equivalent of this)

The generalized theory, for those interested : https://en.wikipedia.org/wiki/Rule_of_least_power


While I'd agree that parsing JSON is much easier than XML, it is still not completely trivial as demonstrated by this article: http://seriot.ch/parsing_json.php


What from I grok the guy requesting JSON-LD wants this functionality


Probably this part:

> simpler to read and write


If you're writing these things by hand, you're probably doing something wrong...


Deserializing somebody else's XML to some usable internal data structures generally requires writing serialization/deserialization by hand and it is always a pain in the ass. On the other hand, JSON basic structures map to reasonable internal representations, so I often can simply iterate through the structures coming as-is from the parser library.

I mean, if the same webservice is offering the same data in both XML and JSON format, chances are I'd have to write less code for handling the JSON endpoint. For a client written in e.g. Java both cases may be pretty much equal, but for dynamic languages like Javascript or Python, the difference is significant.


This is a straw man, IMO. Obviously, in production, the actual JSONs will interact very little with humans. But there's still development, debugging, etc.

So you will need to write small cases during development, tweak existing cases, etc.

Also, many tools accept configuration in JSON, which is somewhat convenient to write by hand, and is easily machine readable. Sublime Text comes to mind, for example.


JSON is also easier for computers to read and write...


XML generators and parsers have been in use for a decade+. Pretty sure most of the bugs have been found and fixed by now.

It's just reinventing the wheel because the new generation don't want to use the same tools the previous generation did. The time and effort spent doing this is quite ridiculous.

(FWIW, I hate XML, JSON is far better. But there's more important things to work on).


> Pretty sure most of the bugs have been found and fixed by now.

Given the complexity and what I've seen from some other long established codebases, I don't share your confidence.

> It's just reinventing the wheel because the new generation don't want to use the same tools the previous generation did.

You can disagree with the decisions involved (as you did with the XML vulnerability argument), but the fact that those arguments exist means they AREN'T doing it just because they don't want to use the same tools the previous generation did - they have different reasons that you think aren't good reasons.

Saying it as you did comes across as smug and dismissive, which is not an effective way of convincing your audience that you've taken arguments into account when making your decision.


RSS is sometimes ambiguous and there's a lot of variation. It can be hard to parse correctly. Not sure about Atom, though.


> RSS is sometimes ambiguous and there's a lot of variation.

I've written a reasonably-popular podcast feed validator, and I don't understand either of these criticisms. Mind elaborating?


Not the parent but my company consumed a bit of RSS starting in 2005 (and with the amounts declining to 0 through the years).

Over time we've been fed feeds with character encodings not matching what the web server nor the XML declared. Use of undeclared XML namespaces, or quite popular: using elements from other namespaces, without namespaces or declarations -- just shove some nice iTunes things or Atom things into the RSS. Also invalid XML -- just skipping the closing tags was popular.

These feeds were from paying customers, and we were not the primary consumers - so when we complained they would generally point to someone else who was consuming it without problem. Sometimes we'd point them at a validator, if they were a small enough customer -- but mostly we just kept working on our in house RSS feed reader that could read tag soup.

Things did massively improve over time, and that by the end we were getting _mainly_ reasonably valid RSS.


Not been writing XML parsers, but I remember Nick Bradbury the creator of the FeedDemon fame wrote about it a lot 'back in the days',

* https://nickbradbury.com/2006/09/21/fixing_funky_fe_1/

* http://nick.typepad.com/blog/2004/01/feeddemon_and_w.html

* https://en.wikipedia.org/wiki/FeedDemon


Since you've done it recently, I'm sure you know more than I do; I suspect my knowledge of it is obsolete.


> I've written a reasonably-popular podcast feed validator

Mind sharing?



Nice, very cool! Definitely an improvement on the trash legacy validators out there.


I couldn't help but take a dismissive stance toward the rest of the page after reading the first paragraph.


Shortly after RSS 0.9 came out RSS 1.0 reformulated the RSS vocabulary in RDF terms. Of course the modern (sane) successor to RDF/XML is JSON-LD.

So I'm hoping for JSON-LD Feed 1.1 and a new war of format battles. Maybe we can even get Mark Pilgrim out of hiding!


Someone should open a social network for feed-wars veterans.

More seriously, it's sad so to see that almost 20 years later, the dream of a decentralised and bidirectional web is in even worse shape than it was back then.


If you create a new JSON-based document format, please consider to use JSON-LD (aside raw JSON data) so we can make a true world of interconnected data through semantic formats. At least, so I can generate code and automatically validate format compatibility from a well-defined schema. Thank you!

EDIT: Because I get downvoted despite stating my opinion on the topic, I adjusted the statement.


No, please don't.


Well now I don't know what to think.


Why not?


Is this a "JSON Feed" from NYTimes?

Example below filters out all URLs for a specific section of the paper.

   test $# = 1 ||exec echo usage: $0 section

   curl -o 1.json https://static01.nyt.com/services/json/sectionfronts/$1/index.jsonp
   exec sed '/\"guid\" :/!d;s/\",//;s/.*\"//' 1.json
I guess SpiderBytes could be used for older articles?

Personally, I think a protocol like netstrings/bencode is better than JSON because it better respects the memory resources of the user's computer.

Every proposed protocol will have tradeoffs.

To me, RAM is sacred. I can "parse" netstrings in one pass but I have been unable to do this with a state machine for JSON. I have to arbitrarily limit the number of states or risk a crash. As easy as it is to exhaust a user's available RAM with Javascript so too can this be done with JSON. Indeed they go well together.


"JSON has become the developers’ choice for APIs", I'm curious about how people feel about this statement from a creation vs consumption perspective.

I'm currently creating an API where I'm asking devs to post JSON rather than a bunch of separate parameters, but I haven't seen this done in other APIs (if you have, can you point me to a few examples?). I'm curious what others thoughts are on this. It seems that with GraphQl, we're maybe starting to move in this direction.


I'd like to see a language available at the item level. You can derive the language from the http headers but if you're dealing with linkblogs it would be nice at the item level to help with filtering.

I think that images and urls would do well as order lists rather than as individual values. at the top level you have 3 urls and an array for hubs. with type and url you could have an array for hubs and the urls. same could be done for images at the top level and both again at the item level.


It's unfortunate that XML has fallen so out of favor that well-made, strongly-schemad formats specified in XML, like Atom, are suffering in turn -- although reasons for feeds' demise go well beyond its forms-on-the-wire. This trend frustrates me, but it's undeniable that a lot of web data interchange happens with JSON-based formats nowadays, and the benefits of network effects, familiarity, and tooling support make JSONification worth exploring.

But even more frustrating is when a format comes out that's close to being a faithful translation of an established format, but makes small, incompatible changes that push the burden of faithful translation onto content authors, or the makers of third-party libraries.

I honestly don't intend to offer harsh targeted critique against the authors -- I assume good faith; more just voicing exasperation. There have been similar attempts over the years -- one from Dave Winer, the creator of RSS 0.92 and RSS 2.0, called RSS.js [1], which stoked some interest at first [2]; others by devs working in isolation without seeming access to a search engine and completely unaware of prior art; some who are just trying something unrelated and accidentally produce something usable [3]; finally, this question pops up from time to time on forums where people with an interest in this subject tend to congregate [4]. Meanwhile, real standards-bodies are off doing stuff that reframes the problem entirely [5] -- which seems out-of-touch at first, but I'd argue provides a better approach than similar-but-not-entirely-compatible riff on something really old.

And as a meta, "people who use JSON-based formats", as a loose aggregate, have a serious and latent disagreement about whether data should have a schema or even a formal spec. In the beginning when people first started using JSON instead of XML, it was done in a schemaless way, and making sense of it was strictly best-effort on part of the receiving party. Then a movement appeared to bring schemas to JSON, which went against the original reason for using JSON in the first place, and now we're stuck with the two camps playing in the same sandbox whose views, use-cases, and goals are contradictory. This appears to be a "classic" loose JSON format, not a strictly-schemad JSON format, not even bothering to declare its own mediatype. This invites criticism from the other camp, yet the authors are clearly not playing in that arena. What's the long-term solution here?

[1] http://scripting.com/stories/2012/09/10/rssInJsonForReal.htm... [2] https://core.trac.wordpress.org/ticket/25639 [3] http://www.giantflyingsaucer.com/blog/?p=3521 [4] https://groups.google.com/forum/#!topic/restful-json/gkaZl3A... [5] https://www.w3.org/TR/activitystreams-core/


why is it size_in_bytes and duration_in_seconds as opposed to content_text and content_html

It should just be size and duration or size_bytes size_seconds (but adding units only makes sense if you could use other units). adding _in to the mix is strange.


A good announcement explains what problem it is intending to solve.


This seems like a great idea. If it can help even one developer it's worth it.


How would it help even one developer?

Or asked another way, what problem does this solve for you?


So, my personal blog doesn't get a ton of traffic, but the one article that gets the most traffic is an article about how to monkeypatch feedparser to not strip about embedded videos.

While not hard evidence, I think it's indicative of the kind of experience a developer has when they choose to engage with syndication.


Doesn't Wordpress already have something like this? http://v2.wp-api.org/

I don't understand why suddenly people treat this like something that uniquely solves a problem. Maybe I'm missing something?


This format is more akin to RSS than to a programmatic rest API. The main goal is to be able to avoid the pitfalls of parsing Atom and RSS feeds. Both Brent Simmons and Manton Reece are quite active in making decentralized alternatives for self publishing for which RSS is the current backbone.


Parsing RSS and Atom feeds is a solved problem, no?


JSON Feed is a new solution for the problem already solved by RSS or Atom. It makes it easier to develop new publishers and consumers. It also tackles the main problems with these two formats, e.g.: no realtime subscriptions, mandatory titles which are a pain for microblogs, potential security problems with XML and so on.

Like somebody somewhere has written: If no one had ever reinvented the real wheel - our cars would be rolling around on big wooden logs


XML is aweful, but it does have CDATA, which lets you embed blog posts directly and it's easy to debug.

String encoded blog posts are going to be painful once people start using the `content_html` part of the spec.


Naw, JSON has reasonable quoting in the strings. It's maybe painful to read the raw json, but it encodes just fine.


i'm surprised no one has started a snake vs camel case debate here! https://jsonfeed.org/version/1


Good lord, Web people, stop it. You are embarrassing yourselves. We already have standards and you need to stop recreating everything in JavaScript.


Brent Simmons is hardly some webdev kid barging it. He was the original developer of NetNewsWire, a very popular/influential feed reader application which is now 15(!) years old.


thanks, knew I recognized the name


Who uses feeds? Who uses XML?


stopped reading after "JSON is simpler to read and write, and it’s less prone to bugs." ....


I have grave concerns that this publishing format is delivered to us by two people that, as far as I can see, have limited to zero publishing background.

That said, they're being responsive to questions in Issues, so I remain optimistic.


Learning about the history of RSS should alleviate your concerns. Brent Simmons has been working in the space for 15 years, writing one of the more popular clients and working at a company which provided sync and syndication services:

https://en.wikipedia.org/wiki/NetNewsWire#History




Applications are open for YC Winter 2018

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: