Hacker News new | past | comments | ask | show | jobs | submit login
Useful old technologies: ASN.1 (2013) (ttsiodras.github.io)
86 points by LaSombra on Jan 11, 2015 | hide | past | web | favorite | 29 comments

ASN.1 is a bit unfortunate. On the one hand, I want to believe: it's a fairly ubiquitous standard (at least, as long as you're dealing with cryptography or a handful of other specific fields), and it's true that it can be a lot more compact than JSON or XML (see below), and it's even occasionally faster. However:

1. A lot of ASN.1 software is pretty buggy and undermaintained. How do I submit a bug to pyasn1? Because it's entrenched in libraries three or four layers below what most developers will see, it's also difficult to replace.

2. Faster and more compact depends on the encoding rules. Let's talk about the dozen or so different ways you have of encoding ASN.1. BER? DER? Maybe CER? XER? Why not both: CXER! I'm missing a bunch, I know.

3. The performance and network arguments ignore what you can do with compressed verbose formats like JSON, XML or EDN. lz4 is magic. Even totally ubiquitous gz compression works extraordinarily well.

4. While ASN.1 messages do contain descriptions of what types they contain (e.g., you can see that there's a bit string coming), they aren't self-describing in the same way that JSON or XML are, which is quite annoying for debugging. You can make self-describing messages using XML, but at that point you're literally just doing XML. Good luck finding software that lets you easily use whichever you like -- even if you subscribe to the notion that easy debuggability doens't matter for production messages, in which case I strongly disagree.

5. I'd harp about its extensibility, but it's really no better than JSON, so I won't.

All-in-all, I'm compressing some EDN, and I'm pretty happy.

I feel the same way about ASN.1. I remember implementing some ASN.1 tools as part of an SNMP engine and thinking "how could anyone possibly implement all of this correctly?" -- a few years later, all those vulnerabilities in SNMP products at the ASN.1 encoding level came to light. On the other hand, it's sad to see efficient on-the-wire encoding ignored, and the ASN.1 standards aren't really abominations when you compare them with some of the web services standards.

In addition to what you mentioned, I wanted to mention EXI (and FAST, the FIX encoding) as interesting on-the-wire encodings that maybe more people should consider over just compressed JSON or XML. Generic LZ-based compression doesn't necessarily win very much with lots of short messages.

I had a similar feeling making an SNMP MIB processor around the same time (2000?) -- the ASN.1 libraries so complicated. So I wrote my own, which turned out simple and symmetrical just by encoding from back to front instead of the messy front-to-back-and-backpatch everyone else was doing. (And I guess by leaving some things out that we didn't use, like other encoding rules? I don't remember.)

When those vulnerabilities hit I never found out how that code did -- I wish they'd open-sourced it as they'd planned.

So, that particular part (DER? again I forget) seemed tolerable to me. The newer stuff like Cap'n Proto is probably still better.

Yes, this! This is the secret to simple BER/DER encoding: back-to-front. I was almost converted to ASN.1 BER after discovering this, but in the intervening 8 years the spell has (thankfully) worn off and I can see it for the clattering technological jalopy that it is.

The features in the author's F# ASN.1 compiler are pretty swank. ASN.1 probably gets a bad rap because of BER/DER.

FAST isn't exactly a FIX encoding, it's a key-value encoding that can be translated to/from FIX. Roughly speaking, FAST is to FIX as BSON is to JSON.

But starting to use FAST today seems like a bad idea, because the largest publicly-known production users of FAST have already moved away from it (toward plain, uncompressed binary structs on the wire).

I worked for a startup doing a network management product for multiple third-party devices, mostly talking to them over SNMP, and their SNMP agents would fail in all sorts of amusing ways. Getting stuck in loops instead of sending all the OIDs you're looking for was probably the most common case, and there were a couple that got tripped up on nulls in their data, but there were others that are more exotic, so we ended up maintaining three different libraries to talk to them (plus our trap receiver). Of course you'd find problems in the underlying libraries from time to time as well, like when there was a memory leak, but only under certain high-latency conditions.

Then there were things like the Meru devices that added or deleted a field and accidentally renumbered all the following entities in the (clearly auto-generated) MIB, on a minor firmware version update.

> Faster and more compact depends on the encoding rules.

Sure; but in reality, you just write your from-scratch systems to use DER, and then throw in a "protocol-upgrade to PER" message/bitflag if things aren't going fast enough. And then if some crazy legacy SOAP+WSDL enterprise wants to integrate their codebase with yours, you add the ability to negotiate XER just for when they're speaking to you. It's not like all the different encodings are equally valid choices when both nodes are under your control; they fall pretty clearly on a spectrum from "whenever you can get away with it" down to "only if you have to."

> The performance and network arguments ignore what you can do with compressed verbose formats...

Wire compression doesn't help if you're sending short intermittent messages across millions of connections, rather than batch-streaming messages across a link. Specifically in the GSM context mentioned in the article, the average control-channel ASN.1 message won't be helped at all by wire compression.

Also, this isn't an argument in favor of ASN.1, but sometimes you need your wire message format to be efficient not because of throughput constraints, but because you need to directly manipulate the resulting data in memory (ala Cap'n Proto), and so efficiency on the wire directly translates to efficiency of packing in memory. The protocol can be compressed on the wire on top of this, of course, but it implies that "just use wire compression" won't solve every efficiency problem.

> they aren't self-describing in the same way that JSON or XML are

ASN.1 comes from an era of record-oriented storage, where a schema is transmitted once, in advance—or possibly baked into the client—and then rows defined by the schema are transferred many times. XML does this well for individual self-describing messages by allowing for a reference to a DTD+XML schema, but ASN.1, being for streams of records, expects you to create a message-container format to reference your schema, so as to remove the overhead of repeatedly mentioning it on the wire.

This last one makes a more general point for me: some protocols reward good engineering done in the code that uses them, and punish bad engineering. If you're trying to take the naive-and-suboptimal approach to something and a protocol is fighting with you, it might be because the protocol "wants" to be used in an optimal fashion.

EDN ain't half bad, true. It addresses some of the JSON shortcomings (like: lack of comments) and is still readable and easy to parse. And I have to agree with ASN.1 middleware - most of it is old and of low quality. But this can be said about pretty much everything that's been around before the ubiquitousness of the Internet (i.e. when developers assumed good faith and hacking wasn't even a topic to be discussed).

And yes, some of the X- encodings for ASN.1 should die in flames (alongside XML in general; flames are welcome ;>). It's still a pretty neat binary format if one uses BER though. But when it comes to binary formats, I really like everything IFF-like. It's similar to ASN.1 in terms of how data is structured, but uses far simpler rules for headers and payload. It's much harder to botch the implementation.

I remember back when Protobuf and Thrift were new, there were discussions about them re-inventing the ASN.1 wheel.

I'd love to hear opinions of people who have used them, and the experience they made. I've found an interesting thread [1] discussing them, and the claimed advantages for Protobuf are as follows:

- faster (real existing software, not some hypothetical ASN.1 compiler that could do x)

- easier to maintain backward compatibility

- much simpler, and thus easier to understand and more robust

Especially maintaining compatibility seems crucial to me in large distributed systems.

[1] https://groups.google.com/forum/#!topic/protobuf/eNAZlnPKVW4

Protobuf is vastly simpler, but has most of the power. As it's used internally by Google it's also very likely to have been audited and reviewed carefully, and thus very unlikely to have security bugs (otherwise all their internal systems would be wide open).

ASN.1 is used as one protocol in the next-gen air traffic control systems currently being deployed in Europe. The relevant aerospace standards spell out a specific subset which helps with getting it correct. Only the single Packed Encoding Rules (PER) encoding is used, for example.

ASN.1 (PER) is also used in the EU eCall -system as the encoding protocol of the data between the on-board unit and the emergency call center.

Coincidentally, ASN.1 (specifically it's DER encoding) is used in X.509v3, better known as TLS certificates. For a taste of the crazyness that ensues check out Peter Gutmann's x.509 style guide [1].

Personal highlights:

* Many different ways of encoding a simple string, with some very obscure encodings.

* SET OFs are sorted in the DER encoding, to ensure consistent bytestreams. This sucks for embedded systems.

* OIDs (unique identifiers for things) are unbounded.

[1] https://www.cs.auckland.ac.nz/~pgut001/pubs/x509guide.txt

I implemented X.509 several times in the late 90's. Generalized ASN.1 is a mess: I don't believe there were any decent open source toolsets then, and the code emitted by the ASN.1 compilers I could test was lame and required you to adopt their conventions completely through your code.

Furthermore, ASN.1 has the usual lameness you get when people build generic description languages: for example, it's quite common to encode a particular ASN.1 structure, and then put the resulting structure into an OCTET STRING for inclusion in a parent structure (take a look at Extension in RFC 5280's ASN.1, for example). This is presumably because ASN.1 didn't (doesn't?) support an ANY type to allow inclusion of arbitrary structures that the decoder didn't know how to parse, so there's no extensibility without such tricks.

In the end, I punted and just used BER/DER directly without ever using ASN.1. This made a lot of things much simpler and produced much smaller and more efficient code (e.g. my cert parser for our SSL library for the Palm III ran with no additional allocation space, and compiled to a few K of code).

Marshalling and unmarshalling are common, yet ill-matched to most programming languages. You usually have to define the marshalled form in some cross-language notation, then use some overly complex tool set to get it to play well with the language. This applies to ASN.1, Google protocol buffers, OpenRPC, and XML. Each has its very own data definition language and tool chain.

JSON is taking over because it's a good match to languages where you can define lists and dictionaries easily. Most languages now have that. The overhead is high, but the simplicity is helpful. As a practical matter, it's usually easier to get things to talk to each other with JSON than with the more rigid forms. Someone is usually out of compliance with the data definition.

There's now "Universal binary JSON" (http://ubjson.org/). That's just a binary representation of JSON. Then there's JSON Schema, which adds a data definition language to JSON. Put both of those together, and you have something very close to ASN.1.

And the wheel goes round and round.

I felt like MessagePack was the clear JSON-as-binary winner. Oddly, ubjson does not mention it. Do you know how it compares?

I'd be curious to know if Google's Protobuf library, and kentonv's followup Cap'n Proro library, use concepts initially from ASN.1.

There is also FlatBuffers which takes Cap'n Proto ideas and brings it back to Google (http://google.github.io/flatbuffers/).

These encoding formats (or rather their implementation) is based on mimizing copying of data. Deep down they are based on mmap-ing memory areas. Not unlike you see the casting of blobs of memory to packed structures (but with more safety).

Nice to see more applications of it. But you don't even need a fancy ASN.1 compiler; OpenLDAP's liblber will do BER/DER just fine. (It uses malloc, but that's because LDAP messages have unbounded size. And it doesn't overuse malloc...)

XML and JSON are both ridiculously inefficient, both for static storage and especially for communication protocols. Can't wait for them to die.

I dealt with ASN.1 some years ago when working on an SMS anti-spam/anti-fraud system. ASN.1 was both amazingly awesome and horribly frustrating at the same time. I have an awful lot of respect for it especially given its age, but I'm also glad to now be working somewhere where I can use JSON, EDN and Transit :)

And if you happen to be dealing at the bit level or have variable-length fields, as found in encodings such as ExpGolomb (typical in audio-visual encodings), then check out Flavor[0].

[0] http://flavor.sourceforge.net

This is orthogonal to the argument in the article, but the "buffer overflow" example in C is incorrect. Even if sizeof(b) is smaller in the receiver than in the sender, the receiver will only read at most as many bytes as it (the receiver) thinks are in b -- whatever it got for sizeof(b). Of course, this could still lead to a truncated message, but we'd all be in pretty big trouble if you could buffer overflow a server by sending it a message larger than its recv buffer :)

Maybe you don't need full ASN.1 but only TLV?


ASN.1 is an awesome way of encoding information, but having variable lengths all over - implementations are very prone to bugs and buffer overflows.

It sure was fun implementing back then.

At some point you will have to deal with items of varying length, and what ASN.1 does is IMHO far better than some of the other protocols. In particular, DER is all TLV, so the lengths are explicitly provided a priori and can be checked easily.

Contrast this with text-based protocols that rely on scanning forward to find a delimiter - the lengths are implicit. I think this is what really can cause bugs, as not having explicit lengths makes it easier to forget to check them against the buffer's size.

On the other hand, there is some research[1][2] which indicates that one way of improving software security and reliability is to (1) realize that any program that accepts input is basically a recognizer of strings belonging to some formal language, and (2) that we should limit the grammar of said language to regular or, at most, context-free. Length fields automatically make the grammar context sensitive which is much harder to secure according to langsec.

[1] http://langsec.org/

[2] http://youtu.be/UzjfeFJJseU

Interesting link, just skimmed it and haven't had a chance to watch the video.

> Length fields automatically make the grammar context sensitive which is much harder to secure according to langsec.

Is this accurate given a finite length field? I can imagine a DFA that recognizes the language of a single byte length prefix followed by strings of 1 to 255 characters, just that the node that consumes the length field will have 255 branches to sub-DFAs that recognize 1, 2, ..., 255 character strings.

Yes, I should have said “unbounded length field”. But, with respect to the discussion, a 32 or 64-bit integer is only bounded in the academic sense.

Also, here is a video that works a bit better as an introduction to LANGSEC: https://www.youtube.com/watch?v=3kEfedtQVOY (around 19:00 is especially entertaining)

That's the tricky bit with the formal language hierarchy: it collapses when you add restrictions to finite quantities. For example, a context-free language with productions limited to a finite recursion depth is a regular language.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact