Hacker News new | past | comments | ask | show | jobs | submit login
JSONCrush – Compress JSON into URI Friendly Strings (github.com/killedbyapixel)
121 points by KilledByAPixel on Nov 26, 2019 | hide | past | favorite | 62 comments



This looks interesting, but I wish there was an explanation of how it works; I'm particularly curious about how it extracts repeated substrings. I tried reading the code, but it starts off with:

  let X, B, O, m, i, c, e, N, M, o, t, j, x, R;
and doesn't get any more readable from there. (And that's the unminified version!)

There aren't any tests, either, so if you're using this and find a bug I guess you just have to hope that nothing breaks when you change a line like

  for(M=N=e=c=0,i=Q.length;!c&&--i;)!~s.indexOf(Q[i])&&(c=Q[i]);


Sorry, that part of the code is from JSCrush which I did not write. I don't fully understand it but here's an explanation someone wrote...

https://nikhilism.com/post/2012/demystifying-jscrush

If you go to the live demo, it will do a test to crush/encodeURI and uncrush/decodeURI to the string you pass in and verify that they are the same. Maybe I should write some automated tests.


Did someone actually write that or did it come from a minified output?


I assume they minified it, but I don't have access to the unminified source. I'd like to clean it up or find a cleaner version, but for now it works well enough.


The article you linked above contains a partially unminified version.

EDIT: On second thought, you'd probably be better off using a completely different compression algorithm that doesn't sacrifice performance for golfability.


I do want to use that as reference if I ever decide to clean it up.


No way, that compression algorithm is amazing. But I also would like to understand how it works by someone explaining it really simply.

I mean it's doing significantly better than LZ. Anyone want to provide simple intuition on these algorithms and where they come from?


The compression works by identifying repeating subsequences in the input data and picking the one that gives the best savings if replaced by a single byte not yet part of the input. This step requires time quadratic in the length of the input. After replacing the subsequence and tacking it onto the end so it can be recovered, the process continues until no repetition can be found or all possible bytes have been used.

It can beat LZ's compression ratio for two reasons:

1. The input consists of only bytes that are valid in an URI, so it doesn't need an additional encoding step.

2. It gets very slow very quickly for larger inputs.


Also, LZ beats it out for longer strings, but not by much for strings in the target range (~5000 characters).


I guess you/someone could try a re-implementation.


Rison (https://github.com/Nanonid/rison) is a nice alternative that seems to be more readable IMHO.


Yes I use this one, not as compact but it remains readable/editable, more notably Kibana uses it as well.


thank you. I was running into a JSON parse error on a script for a couple of days. I love JSON but this compact notation using rison helped me quickly debug the error! :-)

Thanks again!


Looks cool, I just checked into it. It does work well, but doesn't appear do anything to reduce repeated substrings. Maybe a combination of the two would work well though.


I think a more 'popular' pattern is is to base64 encode the JSON. This is how we do it (tracking pixels/get requests), and lots of analytics/ad tech uses that pattern as well. But I'm not sure on the compression/size.


Encoding to base64 increases the size by at least 33%. This is to help make links small enough to share on places like twitter, dischord etc where the max length is around 4000 bytes.


If you compress the json first, it isn't quite so bad.

Example #1 (short string):

  input: 103 bytes
  gzip(input): 87 bytes
  base64(gzip(input)): 117 bytes
Example #2 (long string):

  input: 3122 bytes
  gzip(input): 840 bytes
  base64(gzip(input)): 1121 bytes


Thanks, that is good reference. It makes sense that Zip would beat out JSON for longer strings. Just as another point of comparison, how about encodeURIComponent(gzip(input))?


This project also blows up the size and it seems by much more than 33% on average. You'd safe some space by using base64.


Hold my beer!

...

https://donohoe.dev/project/jspng-encoder/

Life is too short to do something useful so why not encode all your site's Javascript code into a PNG image and then decode it on demand.

(This is a terrible idea. Don’t do it. Just for fun)


this is quite cool. how do you convert the file to a png?

Why is this a terrible idea?

I'm thinking I can Whatsapp friends a huge letter as an image and they could use my toy app to decode it ;-)


A PHP script does the conversion, basically takes the code converts to a base64 png.


Awesome project.


The code is a bit dense for my taste, these are extracts from the unminified version:

for(M=N=e=c=0,i=Q.length;!c &&--i;)!~s.indexOf(Q[i])&&(c=Q[i]);

RegExp(`${(g[2]?g[2]:'')+g[0]}|${(g[3]?g[3]:'')+g[1]}`,'g');


This is the JSCrush algorithm that I did not write and it is kind of magic. The uncompressor is crazy small. You can read more about it here...

https://nikhilism.com/post/2012/demystifying-jscrush


The zzart format looks ridiculously wordy. A lot of compression could be achieved by shortening the property names. I wonder how much BSON would provide in size advantage but it'd have to be encoded to something amenable to a URI.


Here's another approach that compresses even better while retaining (limited) human readability: https://bitbucket.org/tkatchev/muson/

(Though it's mostly meant for making things smoother in statically-typed languages.)


are they usable in a URI though?


No, you'll need to percent-encode them first.


Maybe further saving could be done by converting to cbor before urlencode


Idea: just put the thing in localstorage? It's cheaper, less fiddly, you can encrypt it with something like TEA if you so desire, and doesn't make urls unshareable by existing


> and doesn't make urls unshareable by existing

But when you put json in the URL, you normally do want exactly that. Making the URL shareable.


Yes, localstorage is great, but this for making URLs that are short enough to share on twitter, dischord, etc. The max twitter url length is ~4000 characters I think.


I was looking for something like this last week so was interested to try this. It's just crashed in Firefox with the fans also kicking up to max. It worked in Chrome with a 31% reduction in size, but took a while, and the fans kicked in again.


I had some issues with long strings, like 5000 or greater. What length string were you using?


22728 characters! Maybe that's way bigger than this is meant for.


Yes, I'm interested in trying to make it work with longer strings. Worst case scenario if it gets long strings it could just split it up into strings of length that it can handle. The speed seems to decreases rapidly with string length at some point eventually hitting a brick wall, the sweet spot seems around 1000-5000.


That means the algorithm's complexity is non-linear. I am suspecting the JScrush code currently but it could be somewhere else as well.


It is the JSCrush code. I don't fully understand it, but it's a brute force approach to find the longest substrings.


Why?


I have a personal project that saves state to the URL's hash (https://www.rebalancecalc.com).

I do that because I wanted users to be able to save their state without burdening those users with accounts or burdening myself with maintaining a DB for something so simple.

I also somewhat abuse the history API and use my "read the URL and load state" logic to implement undo-redo via navigating back and forth, though that doesn't seem to work right now. I am working on a refactor that uses redux to implement undo-redo and just replace state, to keep the user's history clean.

Storing encoded JSON in the URL hash is a nifty hack in my opinion. Users can save state in a bookmark or share it with others easily, and it's clear to the users that "where they are" in the URL bar maps to the current app state. Plus, bookmark syncing is taken care of by most browsers to make that state available elsewhere, etc. For the site owner, it means not needing a DB to make an app with some kind of state persistence.

One risk: be sure the state you persist to the URL is in a schema you plan to retain compatibility with! Blind serialization and de-serialization is a recipe for bugs and misery the next time you add a feature.


I do the same on a production app at work, but for a different reason.

Storing confidential data in the hash assures my users that I (the developer) don’t have access to this data, since anything after the hash never gets sent to the server by the browser.

It wouldn’t stop me from sending that information with a post request afterwards but the code is open source and it could be noticed in developer tools.


Sorry I don't understand, you put confidential data into an URL so it's probably in server logs etc.?


If a user types in web browser: https://www.example.com/webpage?q=54#{"userdata":"its a secret"}

Then the web browser sends a message to the server that looks something like this:

   GET /webpage?q=54 HTTP/1.1
   Host: www.example.com
   Cookie: well maybe there's a cookie here
Although it's encrypted with ssl and there's some hopefully irrelevant messages along with it (which aren't irrelevant so they can be used to fingerprint you).

As you can see, the bit that comes after the hash isn't ever sent from the client to the server. It was originally meant so that you could link to a particular section of a longer web page, so it was quite irrelevant.

Nowadays it's exposed to javascript. This means that the code can rely on it - it can read and set it. The javascript author could read it and use it in an entirely inbrowser javascript app. Or the javascript author could read it and send it to the server in a more secure channel, like the body of a POST request, to reduce the chances it gets stored in server logs.

But what comes after the hash is never processed by any standards compliant web server not transmitted by any standards compliant web browser/client.


you missed the part when he says browsers do not send the bits after the hash character.


JSON is pretty ubiquitous for transferring data between web resources, compressing the data with a novel method is always welcome.


Ingress and egress costs are expensive in the cloud. Using this with gzip could save you on cost.


One use case where I've put (uri-encoded) JSON in the query string is for parameterizing GET requests with more structured data, since URL encoding is fairly limited and GET requests don't have a body like POSTS.


You can probably use extended URL encoding for that, which is supported by at least body-parser [https://npmjs.com/package/body-parser]. Not sure about other PLs.


Can't you just base64 url safe encode it and get the same result? Aren't URl strings limited in their length as well?


As far as I know base64 is not guaranteed to be uri safe, though most of the time you should be fine. However more importantly converting to base64 automatically increases the size by 33%

https://developer.mozilla.org/en-US/docs/Web/API/WindowBase6...


The parent poster was referring to the "base64url" variant of Base64 which uses `-` and `_` as the two special characters and leaves out the useless padding, making it url safe. (The 33% expansion still applies though)

https://tools.ietf.org/html/rfc4648#page-7


As others mentioned, size is the main concern here, which is also why JSONCrush could be useful. But I'd definitely use POSTs first if possible.


If you don't care about IE and Edge you can fit 8KB-64KB in the URI, otherwise 2K is pretty safe.


It may not apply in your case, but in some cases, you can give a GET request a body - Elasticsearch relies on this.


From what I've seen, many intermediaries do nasty things if you have a body on anything that doesn't 'normally' have one. I've heard a fair amount about GCP's load balancers dropping unexpected bodies, and other tools giving a 5XX.

It is really unfortunate, because there are tons of use cases and zero reason to interfere with these requests.

The justification is also downright ridiculous. The argument is that "GET, DELETE, ... have no defined semantics for bodies". Meanwhile the 'defined semantics' for POST bodies is... whatever the application decides.


I made this exact argument in session last week in the HTTP working group at the IETF while discussing the semantics of DELETE. The working group chair is quite passionate defender of the position that GET should never have a body, and in the end I was outvoted by the room.

I'm slowly coming around to the idea that he might be right. The problem is the (semantic) question of what resource is being discussed. The semantics of GET, HEAD and OPTIONS are that (unless it gets modified) the same resource should always be answered the same way. If the resource that we're asking about is identified by the URL + the body, then those requests (and DELETE) all need a body too. And then there's an open question for PUT and POST about what resource exactly is being modified by the request - although as you say, the semantics are whatever so thats less important.

I think HTTP has a problem in that its hard to express a resource name that is complex. Like, "the records which match this specific elasticsearch query" is a hard thing to pack into a URL. If HTTP was different, I could imagine a second body-like part of each request called the "resource section" with its own Resource-Type: application/json header and so on. But instead we just have a URL, so we end up doing awful hacks like packing JSON into URLs and make them unreadable. The URL is long enough for this sort of thing - implementations have to allow at least 8k of space for them, and can allow more space. But they're hard to read and display, and there's no standard way to pack json in there.

So I wonder it'd be worth having a formal, consistent way to encode stuff like this in the URL. I'm spitballing, but maybe we need a standard way to encode JSON into the URL (base64 or whatever) defined in an RFC. Then if you do that, you can add a "Query-Type: application/json" header that declares that the URL contains JSON encoded in the standard way. Then we could at least have consistency and nice tooling around this. So like, when you're making a request you could just pass JSON into your http library and it'd encode it into the URL automatically in the standard way. And when the URL is being displayed in the dev tools or whatever, it could write out the actual embedded JSON / GraphQL / whatever object that you've packed in there in an easy to read way.


I don't have too big of a probably with avoiding GET bodies. What I do have a problem with is there's no way to POST an application/json body cross-origin without a preflight, which kills latency. So I'm forced to resort to text/plain or similar hacks.


The OPTIONS response should be cached by the browser if the headers are being set correctly in the response. If you want to avoid the latency cost you can proxy the POST request via your own backend, though you can't send authentication credentials from the user's browser to the 3rd party site that way.

Using text/plain only exists as a backwards compatibility thing. I wouldn't be surprised if that stops working at some point, since it breaks the security model.


Won't proxying the request just introduce its own latency? Either way you get two round trips. Is it measurably better?


I like this as a compression method, but not sure how I feel about sticking it in a URL.


It worked great for my use case http://zzart.3d2k.com

Reduces share urls by about 75% for these very repetitious JSON strings.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: