Hacker News new | comments | show | ask | jobs | submit login
RJSON: compress JSON to JSON (cliws.com)
62 points by dogada on June 5, 2012 | hide | past | web | favorite | 41 comments



I see a pretty major problem with this. It seems to depend on the order of key value pairs in object literals being defined. The order is not defined by the JSON or the ecmascript standards. So you can't really depend on the order of keys in a json object, unless you explicitly define some order (like alphabetical, for instance). I like the basic concept of compressing json to json but this is not a particularly good way to do it- since the order of those keys may not be preserved in round trips through json encoders and decoders in various languages.


I saw the same thing right away. Seems like the first item in the array would have to be used as a hash of key/index pairs to be used for the following items. I guess this would only be useful with large sets of data, the examples makes them look kind of silly =)

So this:

    "users": [
        {"first": "Homer", "last": "Simpson"},
        {"first": "Hank", "last": "Hill"},
        {"first": "Peter", "last": "Griffin"}
    ],
Becomes:

    "users": [
        ["first", "last"],
        ["Homer", "Simpson"],
        ["Hank", "Hill"],
        ["Peter", "Griffin"]
    ],


if you can infer a schema, I'd almost prefer a column oriented arrangement.

   "users": 
        {
         "first":["Homer","Hank","Peter"], 
         "last":["Simpson","Hill","Griffin"]
         },
then the information about the original structure can be restored by a set of object paths which need to be "rotated" from column to row orientation. ["users"]

saving a few characters in the process. though I see that the advantage of this system is supposedly that it can handle any sort of shape of data, not just ones with a fixed schema. I've been trying to figure out how trang [http://www.thaiopensource.com/relaxng/trang.html] does its schema inference trick. (it turns a set of xml files into a relaxNG schema). If you have a schema for a JSON file, it's knowledge you can apply to algorithmically creating really efficient transformations.


Exactly, ZenPsycho, the purpose of RJSON is to convert data with dynamic schema. Fields with default values are omitted often, for example if most of data have 'private' property set to False, it have sense to output it only for 1% of objects with 'private' set to True. This issue is addressed in RJSON.


Prefer this approach. It's more readable and less fragile.


Doesn't this introduce ambiguity? How do you represent list of 'tuple lists'?



you could steal the method rjson uses and do this

   "users": [
        [3, "first", "last"],
        [3, "Homer", "Simpson"],
        [3, "Hank", "Hill"],
        [3, "Peter", "Griffin"]
    ],


[deleted]


that is only enough to make it seem like it works okay. But as I said, there's nothing in the JSON spec that obligates any intervening party to preserve the order of the keys in the object.


It doesn't matter. Even if order of keys will be changed somewhere in the middle during JSON.strigify, as schema id will be always used keys sorted alphabetically and object values are always stored in the same order as keys sorted: https://github.com/dogada/RJSON/blob/master/rjson.js#L213


Thanks ZenPsycho, I added sorting of object schema keys to fix this issue: https://github.com/dogada/RJSON/commit/a27c8927cd0c2d7d151e2...


Would like to emphasize that this is only really useful in environments where gzip is not available (as the OP notes)...some tests using the demo JSON (minified):

test.json = 285 bytes

test.rjson = 233 bytes (18%)

test.json.gz = 205 bytes (27%)

If you are able to bundle a RJSON parser, why not just bundle an existing, well understood/tested compression scheme such as http://stuartk.com/jszip/ or https://github.com/olle/lz77-kit/blob/master/src/main/js/lz7... instead?


An arithmetic coding scheme which has a model based on the probabilities found in JSON abstract syntax trees would significantly improve on most typically used generic compression schemes. Arithmetic coding schemes have largely been avoided thus far due to patents which have recently expired, if I remember correctly.

using the order 2 precise model on this page I get 190 bytes-- and that is still a generic non-json model. http://nerget.com/compression/


This - JSON specific compression schemes aren't going to yield gains over AST friendly schemes unless the JSON serialization specification changes significantly.

Along these lines - shipping a schema with the data payload is avro-like ... which is also questionable in terms of efficiency when compared with gzip/LZO.



They are using gzip compression level 1. Bogus.


Are you referring to the graph, in which they set the gzip compression as "1" in order to clearly show the ratio of compression improvement that their technique has over gzip?


And if you used gzip on a file, is has some overhead (the 10-byte gzip header) and a freshly initialized deflate state. Usually, compression improves when more data is seen, since the dynamic Huffman tree improves and there are more blocks for LZ77 to backreference.


> reduce JSON data size and network traffic when gzip isn't available. For example, in-browser 3D-modeling tools like Mydeco 3D-planner may process and send to server megabytes of JSON-data;

`Content-encoding: gzip` anyone?


I don't believe that browsers let uploads be gzipped, as it can't be sure that the server supports gzipped requests.


Good call. I would think there should be a way to specify it... Something to look into I guess.

EDIT: or as a comment above states, compress (gzip/deflate) it yourself. Not the most elegant, but if space is an issue.


Good idea but i think for those extreme cases where you really have use for this you might as well go with a specialized protocol, which can be based on json if you need it.

I had to do this for an application which streamed several hundreds of data points per second to a browser-client. Both data-size on the network and parsing time in the browser was my biggest issues. Since it was a html-app i had to use either JSON.parse or custom parsing written in javascript, the second option being too slow to be viable. I ended up with something based on almost only json-arrays and then the client would know what the different items in the array meant. With his example it would only look something this: [7, ["programming", "javascript"],["Peter", "Griffin", "Homer", "Simpson", "Hank", "Hill"]]

So in other words it's just enough overhead to make it parsable by json.parse but otherwise you only transfer the actual data.

Note that I wouldn't recommend this option unless you really hit a wall where you realize that this in fact is a bottleneck.


RJSON not only compresses data of any structure but keeps it in JSON format that is huge advantage over gzip/7z and other binary compression algorithms. You can compress 1M of JSON to 300Kb of gzipped JSON but then you will to parse 1Mb of JSON on client anyway. With RJSON you can compress 1Mb of JSON to 500Kb of RJSON, then gzip it down to 200Kb and parse only 500Kb of JSON. Then you can analyze some core properties of parsed RJSON document and unpack it on demand. For example you may send to client collection of 20 documents, parse each and show title of each document to the client, then fully unpack only selected by client document.

I agree with Too that specialized protocol will always win, but RJSON IMO decreases structural entropy almost to 0 without need to debugging own protocol.


If your HTTP connection is going to be gzip compressed, then manual compression of this kind is not guaranteed to reduce the size of the final result, and may in fact hurt it.


If your HTTP connection is going to be gzip compressed, then this will only give you monetary savings to compensate for the additional code complexity if you have medium to big data, at which point storing it in JSON is a questionable-at-best proposition.


How would it compress and then decompress the following document?

  {"users": [
    {"first": "Homer", "last": "Simpson"},
    {"first": "Hank", "last": "Hill"},
    [2, "Peter", "Griffin"]
  }


    {
        "users": [
            {
                "first": "Homer",
                "last": "Simpson"
            },
            [
                2,
                "Hank",
                "Hill"
            ],
            [
                0,
                2,
                "Peter",
                "Griffin"
            ]
        ]
    }
There was a tester linked from the site. Looks like he added the [0, to handle that case.


Yeah, for arrays with first numeric value is reserved schema with index 0. You can see more examples in the unit tests: https://github.com/dogada/RJSON/blob/master/test/tests.js


Unit test suggestion: testPackAndUnpack should also check double-compress and double-decompress for being identical, it'll tend to find any remaining such issues, if any.



There's a demo here: http://www.cliws.com/p/rjson/

It comes it to:

    {
      "users": [
        {
          "first": "Homer",
          "last": "Simpson"
        },
        [
          2,
          "Hank",
          "Hill"
        ],
        [
          0,
          2,
          "Peter",
          "Griffin"
        ]
      ]
    }


I wondered if it was April Fools day... Really, JSON is lightweight enough that it doesn't need compressing. We're not dealing with XML here! This just adds another point of failure.


If you just use GZIP, supported by most browsers, it compresses without the need of special software on the client side to rebuild the original JSON response. And does a better job at compression I bet. I thought of doing something like this to rebuild cycles in object graphs, but I didn't because it requires special parsing logic for the client to use.


similar to JSONH, which i use to pass recordsets back and forth. there's no recursiveness for nested stuff though.

https://github.com/WebReflection/JSONH


dogada, do you have any results showing both size reduction, and speed of compression/decompression vs. other methods?


On algorithmic level compression is mainly identical to JSONH. I didn't do performance tests yet, I still have ideas how to improve RJSON.


It's all fun until someone loses an eye.

What if i look at your API output and assume it's json (only got unique items) but it's rjson? Or whatever?

The most important thing when adding another layer to a protocol is identification.

So, please, put the whole altered object into a rjson root node so it's clear what we're dealing with.


Correct me if I'm wrong, but isn't this what Content-type is for?


I agree that identification is important issue, but IMO it's protocol level issue. RJSON is not protocol, it's algorithm. Someone will prefer to wrap all in {"rjson": ...}, someone like {"format": "rjson", "data": ...} and so on. I belive, algorithm itself should create as less limitations as possible.


The MIME type is allowed to add information about the nature of the data encoded on top of the encoding. This is why we have application/xhtml+xml. It means "this is encoded using xml, and btw the parsed xml structure is xhtml".

The ideal place to handle this is (as pointed out above) content negociation - specifically as 'application/rjson+json'


It's an algo. But the resulting data is a protocol.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: