Hacker News new | comments | show | ask | jobs | submit login
Why JSON will continue to push XML out of the picture (appfog.com)
29 points by lucperkins 1888 days ago | hide | past | web | 45 comments | favorite

I think this article has done a great job enumerating trends that show JSON is beating XML for data serialization applications. I think these points are evidence of a shift in thinking, but not the reason for shift itself.

Why JSON over XML? Because people need data serialization format and XML is a Markup Language. JSON is gaining widespread adoption for data serialization applications since it's the correct tool. XML isn't.

In a markup language there is an underlying text that you're annotating with machine readable tags. Most data interchange doesn't have an underlying text -- you can't strip away the tags and expect it to be understandable. If you're writing a web page, that has to be read by humans and interpreted by machines... you need a markup language.

By contrast, data interchange is about moving arbitrary data structures between processes and/or languages. JSON's information model fits this model perfectly: its nested map/list/scalar is simple & powerful. As for typing, it found a sweet spot with text/numeric/boolean.

JSON is the right tool for the data serialization problem.

This makes sense, but from what I can tell, in virtually no major XML-based systems is the basis for XML files an underlying text extended with markup. Most XML systems, since the dawn of XML, have been top-to-bottom structured data.

You're correct that almost no XML formats, with exception of XHTML, have an underlying text. This is the core problem, since the underlying information model inherited SGML presumes one. XML brings with it significant overhead for dealing with textual data, and most data isn't textual. Why is there an element vs tag debate? Wrong information model -- it's a difference without a distinction when you're doing data serialization. This is why XML was the shoe that never quite fit and why JSON will displace it so easily.

In 98/99, the XML bandwagon was something no one wanted to miss, it was the Web 2.0 and everyone knew it was the future. It was Java/WORA ("Write Once Run Anywhere") for data interchange and promised that you wouldn't be locked into a proprietary application. The marketing hype was simply outstanding. Even for technical people that hated XML itself, the promise of open formats was something you couldn't ignore and had to support even if you had to hold your nose. Open formats have since won -- holding your nose isn't needed anymore.

Now that the marketing hype of XML doesn't shut down the technical debate... JSON will soon dominate for data serialization tasks.

In the financial sector/mortgage/insurace, there are several major standards that are based on XML which is not top-down.

I may be talking out of my ass here, but isn't OO-XML and whatever Microsoft calls their new Office format exactly that?

You're right, I forgot about the "proprietary" document formats that got transformed into "open" (heh) XML formats. Good point. Thanks!

Both OO-XML and ODF are XML based. They both put the main XML document in a ZIP archive along with metadata and other resources.

With the almost unique exception of xhtml. But that's actually the only one I can think of.

ON vs ML, nice

But there's also the stack. XML has XSD for validation and documentation; XSLT and XQuery for transformation; and most people seem to like XPath. The overwhelming response to analogues for JSON is horror - don't pollute our simplicity! - and acknowledgement that while some tasks do indeed need these features, the XML stack already has them. The corruption of XML is what keeps JSON clean.

It also sounds like the right tool for making something like SVG. The vast majority of SVG data isn't text.

In SGML land, you wouldn't have implemented SVG using tags, you would have created a NOTATION with syntax specific to the problem at hand. In XML land, everything is XML; for example, schemas are XML (in SGML they are DTDs, that are _not_ SGML), transforms are in XML (in SGML, they are DSSSL, a lisp variant). The XML approach is one-size-fits-all, not the use-the-best-tool/syntax for the job.

For me the single greatest selling point of JSON is that it's just so danged easy to go from json string to a usable map/list/dictionary in every language. Most of the time you can get from A to B in one or two lines of code.

XML always seemed like such a struggle by comparison. Figuring out which parser(s) you've got installed, figuring out their respective APIs -- it felt like total overkill. The only way I could be productive with XML was using Python's ElementTree API because it was so simple.

Some day I'll need my data to be checked against a complicated schema. But until that day arrives, I'm sticking with JSON.

XML is a smooth fit on strongly typed languages. You can easily translate an exact type into a corresponding XML encoding and know the type of what you're getting out on the other end. JSON on the other hand is duck typing in web service form. You can shove any data structure in on one end, and get it back out the other end, without writing any custom code, and without actually knowing the type of the data you've sent. You could say that JSON itself is weakly typed.

The popularity of JSON is tied to the popularity of weak typing. You can more rapidly iterate your API design and codebase without those bothersome types getting in the way. The flip side of that is the end result isn't "done done". It lacks full validation of input and it lacks complete documentation. In short it's more difficult to use and more prone to bugs and security issues. I suspect that if you compare "done done" API's JSON and SOAP are probably equally productive.

Having said that, I use JSON myself. It's too easy to get going in.

"XML is a smooth fit on strongly typed languages. You can easily translate an exact type into a corresponding XML encoding and know the type of what you're getting out on the other end."

This is a characteristic of the encoding and decoding layer, not the data format. Haskell's aeson library [1] is a JSON serialization library that is perfectly well strongly typed. And yes, that's strongly typed with your local domain datatypes and a relatively-easy-to-specify conversion back and forth, not merely strongly typed by virtue of having a "JSONString" type here and a "JSONNum" type there.

[1]: http://hackage.haskell.org/packages/archive/aeson/

That's an impressively succinct way of mapping types to JSON, but it's still a mapping. There's one step for the developer between obtaining the JSON and using its data. In weakly typed languages there is no such step, the JSON data is the object you interact with in your business logic.

There's always a serialization step. The type of the resulting data is a consequence of the serialization technique, not the data format. I demonstrated the part you seemed to most strongly claim didn't exist, JSON <-> strong typing, but I can show you "weakly-typed" XML too. In addition to the DOM, which is a standardized "weak type" XML representation, you also have things like ElementTree http://effbot.org/zone/element-index.htm .

It is the case that JSON has a simple default weak serialization in many popular languages, and that it is a great deal of the reason for its popularity, but it is worth pointing out this is a local effect in the Javascript/Python/Perl/Ruby space, and that it hasn't got anything to do with strongly or weakly typed but rather what the target languages shipped with. There is no natural mapping for JSON in C++, C#, Erlang, Haskell, Prolog, SQL, or a wide variety of other languages (and Erlang and SQL are both fairly "weakly typed"), and even in JS/Python/Perl/Ruby there are some edge cases that can bite you if you aren't careful about exactly what the "just_decode_some_json_man()" is really doing with Unicode and numbers that may not fit into 32 bits.

(Also, I scarequote all my "weakly typed" because the term is basically ill-defined. I'm coming around to prefer "sloppy type", which is a language where all values are perfectly well strongly typed but the language and/or library is shot through with automatic coercions and/or extensive duck typing. A sloppy type language considers it a feature that a function may have a a value and not really know or care what it is.)

I think part of the reason "weakly typed" is ambiguous is because it's a bit pejorative, and "sloppy type" certainly isn't helping that. Maybe just "less typed" ? It really is an engineering tradeoff of how many assumptions you want to make explicit.

In another sense, XML is strongly typed, in that it has a schema language (XML Schema - and DTDs are part of the XML spec itself).

I don't understand why JSON schemas and the OJMs (Object-JSON-Mappers ;) it would enable aren't being developed more heavily.

I love JSON, but when working on APIs between large companies / departments the "we'll just send JSON like this and email you when we change stuff" really won't cut it.

Let me state that for the record, I believe JSON is a fantastic data interchange format, especially when compared with the current state of XML.

However, the point you've touched on is exactly my gripe with JSON. I just might not know enough, which is completely adequate, but afaik all the JSON schemas are either extremely complicated (I'm looking at you json-schema) or way too simple (jschema).

When working with service oriented architectures and if you're following the principles of RESTful architecture, discoverability and HATEOAS become central to your service. That means that the API needs to be self-documenting.

How does one do this with JSON? Essentially, if you boil the problem down, what someone would try to accomplish is "marking up" their JSON responses/requests. The irony is hilarious because this is exactly the job that XML was designed for.

It's obvious that the XML ecosystem grew way of control exogenously, but the core concept was very simple and was designed to solve this exact problem which I think the JSON ecosystem currently lacks.

HATEOAS, discoverable apis, and the overengineered gunk that has today hijacked the name of REST (ironic, when much of the point of the original REST was to be not SOAP) remain pipe dreams. You can't write an API that clients that don't know your API can use, and I suspect it will take at least a hundred years until that becomes possible.

Certainly XML schema doesn't let you accomplish this. All I've ever seen it accomplish is telling you that a document doesn't conform to the schema, functionality that you can trivially achieve in JSON any number of ways (e.g. an API version field in the data).

There's no point trying to write a schema system without a use case that it can solve, and I've never seen such a thing.


> JSON-LD (JavaScript Object Notation for Linking Data) is a lightweight Linked Data format that gives your data context. It is easy for humans to read and write

with json-schema, the instance objects would have a $schema property referring to the schema of the document.

the schema has a links section, where you can define all the various related paths to use.

a 'suitably intelligent validator' should be able to resolve the links for you.

[1] http://tools.ietf.org/html/draft-zyp-json-schema-03#section-...

Could someone who knows a lot about these things tell me why JSON took such a long time to arrive?

JSON, at its core, is essentially a hierarchy of maps and lists -- which seems a very intuitive and useful way to store data.

XML on the other hand has always baffled me with its attributes and the redundant and verbose tags (why do I need <tag attr="data">data</tag>?). I'm sure there was a good reason at the time for this, so perhaps someone can enlighten me.

What took the longest time was for a language to come out with key-value maps as the main core data structure, and a specialized literal syntax. Once that happened, it was relatively quick for that syntax to become a standardized interchange format for K-V data.

Lisp had assoc-lists, but those were a convention, not a specialized structure. Many languages had K-V maps as libraries, but not core structures, and most lacked literal syntax. Eventually most scripting languages starting getting them as native, and even having literal syntax, but they weren't the "go-to" data structure for doing things. In Python, for example, all of its objects are really just hash maps, but when you're working with them you pretend that they're objects and not hash maps, and you use lists more than maps anyways.

JavaScript (and maybe Lua) was the first language to build itself around K-V maps, so it was the first language where idiomatic usage included a lot of map literals. Like Python, its objects were all really just maps, but unlike it encouraged taking advantage of that fact. Also, because it was on the web, there was a lot of need to be serializing data structures and passing them around. Eventually someone realized "this is much better than XML!" and gave it a name, and that's how we got where we are today.

XML's popularity is an accident of history, due in part to the rise of HTML, which is also an accident of history.

> Lisp had assoc-lists, but those were a convention, not a specialized structure.

I'm not sure what you mean by this. What's the difference, syntactically, between a convention and a specialized structure?

XML is Lisp. In fact, the XML grammar and Lisp's grammar are (almost) homomorphic[1]. SXML is a trivial mapping of XML to s-expressions which demonstrates this.

There's no point in comparing XML and S-expressions like that; they're essentially the same thing!

If you're talking about internal representation, well, that's up to the compiler. But since you have to declare the format either explicitly or by context, there's no 'advantage' of XML over s-expressions.

[1] To be pedantic, XML is homomorphic to SXML, which is a subset of the Lisp grammar, but that just means that Lisp recognizes some strings that aren't in the XML grammar, so if anything, Lisp is more powerful, but that's beside the point.

> I'm not sure what you mean by this. What's the difference, syntactically, between a convention and a specialized structure?

The big difference is how people look at it, not what it really is. JavaScript has a built-in key-value map data structure. An assoc-list isn't a built-in data structure, it's a way of using a more primitive data structure (lists). In particular, assoc-lists don't really look different than normal lists, so it's a slightly larger mental leap to think in terms of them. Furthermore, Lisp doesn't use assoc-lists as often as JavaScript uses key-value maps, preferring flat lists instead, so even if there were a specialized reader-macro for assoc-lists it wouldn't have been as ubiquitous.

I agree that XML and Lisp grammars are basically interchangeable. My comment was answering a question about the emergence of JSON, and my comments about assoc-lists were only in relation to JSON, not XML.

Lisp also had plists, which practically look like Json:

For instance (:name "Bob" :age 50) being a plist called person would give (getf person :age) as 50.

That seems pretty similar to {"name": "Bob", "age": 50} and person.age to retrieve the value 50.

Honestly, JSON isn't really much of an improvement over a technology that's been around since 1958, i.e. s-expressions. JSON is just the flavor of the month - I know people who dislike it because it loses some of the power of XML (XSLT, attributes, etc.)

In the end, I think it's just subjective. All of the above formats are equally capable of representing the same data.

I don't know a lot, but:

XML looked like HTML at a time when the web was the "next big thing". Like, not a technology on the web, or social media, HTML itself was this big revelation. So a solution that looks like HTML has a leg up.

Then, once there were mature parsers in a lot of languages, server software configured by it, etc, XML had some inertia that takes time to displace.

Formats very similar to JSON have been invented many times, for instance NeXT/Apple used to have a human readable format for their plist files that was basically JSON with different characters. http://code.google.com/p/networkpx/wiki/PlistSpec

For some reason they stopped using text plists and replaced it with an ugly xml serialization, but you still see these in Xcode's debugger if you log an NSArray or NSDictionary to the console.

I've actually always liked the idea of XML at it's core- attributes and so on often make data structures easier to understand (just look at HTML), but namespacing and all that junk ruined the whole thing.

The only reason I still use XML every now and then is XPath. There are third-party alternatives for JSON, but XPath is ubiquitous.

JSON conceptually has been around forever.

Lisp s-exps date to the original McCarthy paper in 1958, and could represent pretty much everything you can do with JSON, key-value pairs, lists, nested structure, etc.

From a big data perspective, I'm pretty sure people were making do with CSV files before JSON came along. I think most practitioners would not subject themselves to stupid, stupid XML unless they really had to.

Well-written XML that was designed for humans instead of machines is much, much more easier to read than JSON. The primary reason is that unlike s-expressions or xml, there is no block-name. In JSON you loose valuable time figuring out the block context in a hierarchy since this isn't labelled.

The only kind of JSON that is readable is flat JSON that is nested to a maximum of 1 level.

Either format can be pretty printed. If you need signposts to figure out where you are, it is a simple manner to add dictionary key names to things.

I don't foresee JSON ever replacing XML as a "full-blown successor" in the contexts where XML actually is useful: for marking up documents.

As a general data storage format, XML is certainly going away.

It might for non-text document structures, perhaps.

JSON is not a silver-bullet. Actually I think JSON-only APIs suck -- an API should have an equivalent XML alternative as well. Let me explain.

Web APIs are not only consumed by client-side Javascript-based AJAX apps -- they are also used by server-side (web)apps where Javascript is much less widespread. If the primary application language is not Javascript for which JSON is a native format, but PHP or Java for example, then its value is much lower.

There are established industries such as publishing that use complex XML workflows -- I don't think JSON will push them out.

XML family so far has much better standard specifications and tool support. Some of the most useful are XPath and XSLT. There are also advanced features -- too complex for some, useful for others -- like namespaces and schemas. If JSON is to expand its use, it will have to go to the same interoperability issues XML addressed, and develop similar features with similar problems. That's why the idea of JSON schemas sounds funny to me.

Let me give an example. I've developed a semantic tool that lets me import 3rd part API data as RDF. If it is available in XML, I can apply a GRDDL (basically XSLT) transformation to get RDF/XML -- and boom, it's there. RDF/XML serves as the bridge format between XML and RDF.

Now if the data is JSON-only, what do I do? I could download an API client, try to write some Java or PHP code, but that would be much less generic and extensible than XSLT. I could probably try a pivotal conversion via JSON-LD somehow, but oh, bummer -- there's no JSON transformation language? Or is there... Javascript? Thanks, I would prefer XSLT anyday since it is designed specifically for these kind of tasks.

My point is, by offering JSON-only you cut off all the useful tools from the XML world, which is pretty well established. I see JSON as an alternative syntax to XML, which is easier to use with Javascript -- but by no means THE "right tool" to all data serialization problems.

One of my biggest issues with JSON is it's a lot harder to generate valid JSON as a stream. Granted this may be an esoteric use case, but the quoting rules and type representations seem to require some amount of look ahead which isn't fun when generating that stream.

For human readable stuff, I don't know why we don't use YAML more often. The serializer is utterly fantastic, though I don't think JavaScript support parsing it quickly.

Poorly implemented parsers, especially in Javascript.

I think JSON is more popular than XML for a lot of things simply because it's so much simpler to interact with. No querying attributes, elements, elements inside elements, text inside elements, etc. You just look up the value attached to a key, or look up an index in an array, and that's it. It's simple every level down. And it's also simple to construct.

> [XML] enabled people to do previously unthinkable things, like exchange Microsoft Office documents across HTTP connections.


JSON is data. XML is markup. JSON won.

Move on.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact