Hacker News new | past | comments | ask | show | jobs | submit login
STON – Smalltalk Object Notation (2012) (github.com)
104 points by mpweiher on Mar 15, 2017 | hide | past | web | favorite | 59 comments



While many applaud this notation superior to JSON let's face the criticsm:

It's slower than JSON, less readable than JSON and YAML, and more insecure than JSON.

To get this feature set, use YAML 2.0. YAML supports classes and references already, and is more readable then STON.

Slower: Having to support references, STON needs to store every object in a hash. JSON doesn't need to. JSON is at least 10x faster.

Readable: JSON and YAML have much less syntax, and are more pleasant to the eye.

Insecure: Changing classes in serialization protocols without any protection is the most common exploit vector. Esp. supporting user-classes, not only builtins. YAML at least supports builtin-classes only via tags.


> Slower: ...every object in a hash...10x faster.

Did you measure the 10x faster?

Anyway, the dictionary is only needed on serialization, and then you only need an identity dictionary, which tends to be quite fast as it is only hashing the pointers. Deserialization only needs an array, with negligible overhead.

Furthermore, with JSON you tend to decode/encode via temporary dictionary representations anyway, which is many times more expensive when you're starting from an object representation. Going to an object representation also often materializes a full array/dictionary rep. because that's what the JSON specifies. You only figure out objects afterward.

So in my estimation, this has the potential of being significantly faster than typical JSON en-/de-coders.

[Background: I worked on XML parsing for a while, and having the tag preceding the contents was invaluable in creating efficient parsers that do not require a generic intermediate representation. Such generic intermediate representations typically do eat around 10x performance. See https://github.com/mpw/Objective-XML/tree/master

Although the first time I noticed about the effects of generic intermediate representations was parsing Display Postscript's Binary Object Sequences generated from a PS->PDF distiller written in Postscript to graphical objects]


Yes, otherwise I wouldn't come up with such numbers. I maintain several such serializers with various feature sets.

Yes, only serialization would be slow.

Yes, going directly to objects would be faster also. YAML or native serializers do that.

But your estimation is flawed. Fastest are still special pre-compiled serializers, such as protobuf, or better variants thereof which can mmap directly native objects.

then binary serializers such as messagepack or the insecure bson because of compression.

then json because of simplicity and limited recursion.

then the dynamic monsters, which support references and objects.


I would disagree that JSON is more readable than STON, if only for that fact that all keys have to be quoted in JSON, but I agree with the rest of your comment.


I was referring to the # pound syntax for fields. fields can be used with a . or - prefix much more pleasingly to the eyes. An untrained eye would take the # for comments.

    TestDomainObject {
      #created : DateAndTime [ '2012-02-14T16:40:15+01:00' ],
      #modified : DateAndTime [ '2012-02-14T16:40:18+01:00' ],
      #integer : 39581,
      #float : 73.84789359463944,
      #description : 'This is a test',
      #color : #green,
      #tags : [
        #two,
        #beta,
        #medium
      ],
      #bytes : ByteArray [ 'afabfdf61d030f43eb67960c0ae9f39f' ],
      #boolean : false
    }


Your untrained eye took the # for comments, but I'm not sure that generalized to anyone's. Plenty of languages use # for other things: Smalltalk uses them for symbols, Clojure uses them for reader macros, OCaml uses them for method calls, Haskell can use them as an arbitrary operator, Lua uses them for length, C and C++ use them for compiler directives. See https://en.wikipedia.org/wiki/Number_sign#In_computing for a list.

Admittedly, Python, Bash, Perl, and Ruby use # for comments, so you do have a point, especially since Python is such a common teaching language.


No, the point is the clutter. A pound # is the most intrusive ascii character, with about 85% black. Older languages were critised to use $ with about 55% black as prefix. It's like writing in ALL-CAPS.


This is a very valid critique in terms of graphic design.

(For anyone confused by the mention of graphic design: it's all about communicating information through visuals.)


Oh, I thought they were hashtags, I was gonna see if they were trending on twitter.


"An untrained eye would take the # for comments"

\begin{pedantry}

I'd argue that an untrained eye would take the # for ordinary syntax noise, since such an eye would have no notion of a "comment". A trained eye might be confused by that, but only if it was trained against an environment that uses # for comments.

\end{pedantry}


Since this looks like another human readable encoding shootout, I humbly offer luxem [0] which allows unquoted keys. Simpler than JSON, much simpler than JSON5, and probably as powerful as STON.

[0] https://github.com/rendaw/luxem


All these text-based serialization[1] formats are IMO relying on the same fallacies used to sell XML 15 years ago.

Would you rather use XML for serialization, or an ad-hoc, undocumented binary format?

Er, XML I guess.

So it's proven: XML is better than binary formats! The golden age of XML is upon us!

[1] The specific task of serialization over a network; for configuration files or other applications involving direct manipulation by a human or a shell script, JSON or YAML (or XML) can be reasonable.


> [1] The specific task of serialization over a network; for configuration files JSON or YAML (or XML) can be reasonable.

I would argue that JSON is not reasonable for configuration files that are meant to be edited by humans, because they don't support comments. It's also annoying to write JSON by hand because all keys have to be quoted.

YAML on the other hand, is a very nice configuration format.


Also the lack of trailing comma means often editing two lines to add new elements to lists or objects. I've actually experienced multiple outages at different companies where someone helpfully added a trailing comma to a JSON config.


Exactly!

While it's nice to have easy-to-read serialization formats when debugging, it strikes me as an attempt to come up with one format for both configuration and serialization. But the 2 have very different design objectives.

Once we get over our fear of binary formats, there are some simple and generic ways to do serialization that can be used in any language, like Erlang's OSC[1], discussed on HN last year[2].

[1] http://joearms.github.io/2016/01/28/A-Badass-Way-To-Connect-...

[2] https://news.ycombinator.com/item?id=10976737


> Erlang's OSC

Uh ? This is just some guy that used OSC in Erlang. OSC is used in many many many languages, mostly used in computer music and interactive arts : https://en.wikipedia.org/wiki/Open_Sound_Control

And guess what, it's honestly a pain as soon as you have a big application.

Source: I'm developer of an OSC sequencer (http://www.i-score.org) and a common complaint from artists with between 1000 and 20000 parameters in their OSC namespace is that it's absolutely unamanageable ; a lot of people are trying to add order, type systems, you name it, on OSC to bring some safety and documentability to the protocol.


Meh. XML is a subset of what you can do with s-expressions, with higher scanning and parsing overhead. Actually now that I say that it's pretty much true for any serialization format.


This looks superficially similar to (a related subset of) QML code.

    Rectangle {
        id: photo                                  // id on the first line makes it easy to find an object

        property bool thumbnail: false             // property declarations
        property alias image: photoImage.source

        signal clicked                             // signal declarations

        function doSomething(x)                    // javascript functions
        {
            return x + photoImage.width
        }

        color: "gray"                              // object properties
        x: 20; y: 20; height: 150                  // try to group related properties together
        width: {                                   // large bindings
            if(photoImage.width > 200){
                photoImage.width;
            }else{
                200;
            }
        }

        ....
    }


Interesting. There are alternatives to consider, though.

If you want this feature set, YAML is a reasonable alternative.

If you just want to add support for comments, trailing commas, and a few other things, JSON 5 is an alternative: http://json5.org/

If you're processing lots of S-expressions (e.g., Lisp code or data), the readable Lisp notations (including sweet-expressions) might be of use: http://readable.sourceforge.net/


EDN does all this, and looses the commas too (winning!).


I prefer LISP notation, because it naturally allows to store functions.


While I love to sing love songs to Lisp, I see Lisp's notation as naturally allowing the storage of lists and the storage of functions/macros/structs etc. as a matter of interpretation at a higher level of abstraction where the language semantics live instead of its notation.


Of course, but try to do it e.g. in JSON, and you'll see that the resulting representation quickly becomes convoluted. In contrast, the LISP notation is basically the same notation as you'd use in the language LISP itself. That's what I mean by "natural".


I'm not clear on its advantages for the serialization of Smalltalk objects. Would the deserializer be written in a Lisp?


You can treat the serialized S-expressions literally as Lisp code. The first token of the list could be the name of a macro, for example, that could expand into any kind of code you want to execute.

(So yeah, you better be really, really sure you control the data you are processing this way.)


I made a basic LISP-like JSON-like thing: http://loonfile.info


You might want to check out https://github.com/edn-format/edn


Transit tries to solve the same problem with JSON https://github.com/cognitect/transit-format


STON is more similar to EDN than to Transit.


I myself use Rebol/Red which I find much easier to use.

Here's a translation of the first example:

    test-domain-object: make object! [
        created:  2012-02-14/16:40:15+01:00
        modified: 2012-02-14/16:40:18+01:00
        integer:  39581
        float:    73.84789359463944
        description: "This is a test"
        color: green
        tags: [
            #two
            #beta
            #medium
        ]
        bytes:   #{afabfdf61d030f43eb67960c0ae9f39f}
        boolean: false
    ]
And there is REN (REadable Notation) which is an attempt to produce a (sub-set) standard so it can be interchanged with other languages - http://pointillistic.com/ren/ | https://github.com/humanistic/REN


It's like JSON, but cool. ;-)


Interop between different languages still requires specific implementations written in them, so replacing type annotations with, say, a property called "$t" sounds like a good tradeoff regardless. (and a rather cosmetic change, unless I've missed something obvious here)

I've been happy with this approach when working with a stack built with JS and C# - it might not suit everyone, ofc.

[edit] typo


Why a downvote, sir?


Looks like an improvement over JSON.


No, of course not. It makes this kind of JSON insecure and slow.

You can use YAML for this kind of stuff already, and it's still more readable than STON.


I am trying to set up deploy script using Ansible just now - which uses YAML. Its more readable in one sense, but indentation gets a bit confusing when you are not especially familiar with it. JSON is more obvious in that respect.

I wish we could have Python dictionaries as an alternative to JSON, as large JSON configuration files are horrible, when you add an extra trailing comma or things get too nested.


I really like TOML: https://github.com/toml-lang/toml


Why is STON less secure than JSON?


Because you can write arbitrary classes in the output, which can then be called (since dispatch is dynamic, and the output object is assumably either untyped or typed as the equivalent of `Object`). If you know which classes are available you can arbitrarily choose one as an attacker. JSON is data only, no functionality.


That doesn't make the format insecure, it means it is possible to write an insecure deserializer (which is possible for any format); it's also possible to write a secure deserializer, or one that can be configured based on use-case to balance exposed functionality vs. security appropriately for the use case.

One-size-fits-all security-through-lack-of-functionality is not necessarily the ideal approach for all data transfer applications.


But you need constant vigilance to keep the deserialiser secure. In JSON, any change to the deserialiser requires checking for security holes and DoS attacks. With this, you also need to check every method in every class in the set of possible deserialisation targets. If you don't have any methods and it's all just pure data (which would ensure equal security to JSON), then why are you using this instead of JSON. It is _absolutely_ security through lack of functionality. You simply grant less powers to untrusted inputs, as well you should. If you're designing a programming language intended to be written by hostile actors, you would not include primitives for opening files, running applications or spinning infinite loops, since they are all security issues. That's security through restriction of functionality. JSON is trivially secure, this is not. You need to maintain its security.


security-through-lack-of-functionality <- Excellent point! One should decide where the load of security should be paid.


I wrote this: http://search.cpan.org/~rurban/Cpanel-JSON-XS/XS.pm#SECURITY...

One of the relevant object serializer exploits was CVE-2015-1592


why?


From the Rationale paragraph:

However, JSON knows only about lists and maps. There is no concept of object types or classes. This means that it is not easy to encode arbitrary objects, and some of the possible solutions are quite verbose (Encoding the type or class as a property and/or adding an indirection to encode the object's contents).

Adding a symbol (globally unique string) primitive type is a very useful addition: because symbols help to represent constant values in a compact and fast yet readable way, and because symbols allow simpler and more readable map keys.


As a clojure guy, I use EDN for the same reasons that STON exists (but of course EDN plays beautifully with clojure)


This is very true! What a shame the no doubt very clever people who specified ES6 failed to read it, because if they had, I can't imagine that ES6 "symbols" would be so bizarrely broken.


> I can't imagine that ES6 "symbols" would be so bizarrely broken.

In what way are ES6 “symbols” broken? (genuinely interested, I haven’t really used symbols before)


They're not like what are called symbols in any other language.

An ES6 symbol is a unique object with an optional name. They can be used as property names, like strings, which means that code can add new properties to objects without worrying about name collisions. Two symbols are distinct objects regardless of name: Symbol("foo") is never the same symbol as Symbol("foo").

Symbols as used in other languages (sometimes also called atoms) are effectively unique instances of strings. They chiefly serve the opposite purpose of ES6 symbols, which is being able to refer to the same object by name in different parts of code, even across executions of the program, or across distributed systems. i.e. :foo is always the same symbol as :foo.

They're (often) more efficient than simple strings because :foo and :foo will always reference the same object in a given instance of the program, so they can be compared by address rather than character-by-character.

Some languages with symbols have a function, often called gensym, for generating a symbol with a unique name, for when you need to support ES6's use case. e.g. Lisp and Prolog.


Given ES2016's intended goal to remove a lot of the global interconnected nature of the language (let/const over classic var; module scopes), it seems clear why ES2016 "over-corrected" and only supports the gensym-style unique symbols and exporting them at the module level if they need to be reused. (From whence known symbols like Symbol.iterator exist.)

I've seen strawman proposals for ES to also support some form of a global symbol namespace, but after debugging much of the legacy of JS global-happy code and order-of-script-tags bugs I, for one, am happy that none of those strawman proposals are currently favored by the committee.


There are global symbols. You use Symbol.for() to either create or retrieve them. E.g. Symbol.for('hello') is available globally through the global registry. It should be noted that an already existing symbol e.g. Symbol('hello') is not the same as Symbol.for('hello') while Symbol.for('hello') === Symbol.for('hello')


Thanks, I had forgot that had made it in after all.


I'd just stick with JSON. I think the reason why JSON is so popular is that it works with most languages. This would only be compatible with object oriented languages. Adding in code blocks would only make it easily compatible with one.


> This would only be compatible with object oriented languages.

Why do you think that?

It's compatible with any language with named record types, which is likely to be about any language which can parse JSON in the first place. It's particularly compatible with Erlang, which has named record types and (the equivalent of) symbols, and Erlang isn't generally considered object-oriented.


> This would only be compatible with object oriented languages.

Even then, it's only compatible with an implementation that either implements or stubs all the classes that a given STON body references. I can see where this makes sense for interop among Smalltalk environments with enormously complete standard libraries, where everything likely to be referenced in an arbitrary STON blob is available, but for interop among implementations in multiple languages, it's a no-go unless either:

- STON use is limited to a well-defined subset implemented by all parties in the interaction, or

- all languages use STON parsers which support automatically stubbing (ignoring, etc.) classes specified in content but which aren't available in the parsing context, or

- all parties in the interchange provide custom-implemented and probably dangerously incomplete translation layers between STON and the rest of the implementation.

That said, there are a couple of things here that I really like. I was about to say that, since es6 has a native symbol type, it might make a lot of sense to include symbol literals in a new version of JSON - but I've just taken the time to actually examine es6's symbol implementation for the first time in detail, and...well, I'm not sure precisely what it is, but I'm quite certain it's not what it claims to be, and its justification for even existing is gravely in doubt. That's pretty special, and it also means that STON symbols wouldn't even make sense in the context of es6, so never mind. The internal references in STON are pretty neat, too, but it'd be a hairball for any parser to implement and pretty useless unless every parser implements it, so never mind that too.

Oh well. If I weren't accustomed to compromise, I wouldn't spend so much time writing Javascript, would I? I will say, though, this "symbol" business really depresses whatever enthusiasm I had for deep-diving the post-ES5 variations of JS. If they're so far off the mark with something as trivially simple as symbols, God alone knows how badly they'll have handled the parts that are at all complex.


> only compatible with an implementation that either implements or stubs all the classes that a given STON body references.

Don't see why that would be the case. You can easily ignore the class info and then interact based on arrays/dictionaries. You just don't get the benefits.


I suppose I'd call that an extreme case of stubbing; if I could still edit, I'd replace that with "implements, stubs, or ignores".


"Extreme" in that it ignores everything, but very simple to implement: parse the name and toss.


This is not a replacement for JSON.

For one, it's supposed to be able to serialize class instances too.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: