Hacker News new | past | comments | ask | show | jobs | submit login
Use streaming JSON to reduce latency on mobile (instantdomainsearch.com)
189 points by beau on Feb 19, 2018 | hide | past | favorite | 84 comments



I agree that a streaming response is cool, but why the dismissal of returning valid JSON, streamed? Why invent a new protocol when JSON already exists? Streaming JSON parsers aren't unicorns, they are horses (sorry).

With this new-line-delimited JSON format all your clients HAVE to know about your new protocol. They have to stream the response bytes, split on new lines, unescape new lines in the payload (how are we doing that, btw?), etc. If a client doesn't care about streaming, it can't just sit on the response and parse it when it's done coming in. Or, how about if later on you upgrade the system so that the response is instant and streaming is no longer necessary? Then you move on to a new API and have to keep supporting this old streaming-but-not-really endpoint forever.


I've been using jsonlines for logfiles from about 5 years before it acquired a name.

With line delimited logfile objects, it's easy to grep for a string of interest and then only parse the lines that match -- much more efficient than parsing an entire logfile to pick out 0.01% of the lines.

That's not the usecase talked about in this article, but it is a usecase that's important to me.


Jsonlines is also an excellent intermediary format for working with any huge datasets that are just too big to be loaded into the memory all at once, and where data is hierarchical so csv is not an option. Compared to XML you save a lot in the total file size, and it's just as easy to parse.


It's also easy to load them into Google's BigQuery: https://cloud.google.com/bigquery/docs/loading-data#supporte...


I was just about to name that as an example. Newline separated JSON is a so much simpler solution to the problem than streaming JSON parsers. They're easier to understand, and it's trivial to implement.

From a purity point of view, sure, streaming JSON is more "correct", but I wouldn't call it the better solution.


http://jsonlines.org/

> Each Line is a Valid JSON Value

This seems like the better option I agree.


Google's lead me to this one today, lol: http://ndjson.org/

I'd make fun of NIH but I'm so guilty of it too.


I've just googled what jsonlines was and it turns out i've been using it for a while in log files too where I wanted JSON objects but needed it to behave like random access memory (exactly like you described with grep). I hadn't even realised it was a formal thing. Thank you


> That's not the usecase talked about in this article, but it is a usecase that's important to me.

Yeah, and I agree with you 100%. But lot's of things that are great for log storage aren't appropriate for an API.


Together with jq, jsonlines make for great logging.


it should be trivial to write a tool to stream a json file with proper newlines


Not sure what you're getting at -- my tools that generate json logfiles do trivially add newlines where I want them.

jq's "compact" mode will reformat files that don't have newlines where I want them.


I mean that the tools to add the newlines to a compact json file with minimal whitespace are not going to be the bottle neck. I am fairly sure that grep uses a forward streaming access pattern and not random access so inserting a relatively simple command into the pipeline to insert newlines, like jq, probably isn't going to slow it down.

The tool itself wouldn't be difficult to write, or like you said, just use jq


Yeah, that’s really weird. They mention streaming parsers, then immediately say “ignore them.” Why? They apparently think this solution is better somehow, but it would be nice if they’d explain why.

Also, if you’re going to reinvent the wheel and make a custom framing format, why would you choose a delimeter that can legally appear in your content? Separating your JSON with newlines is complete madness. If you’re sending UTF-8 then you can trivially use a byte that cannot appear in the data, like 0xff, as the divider.


> it would be nice if they’d explain why.

Just to fill in the picture here: it's because the built-in parser has a very good chance of being faster than anything you could write[1]. In addition, it's code that you don't own and don't have to maintain.

[1]: with exception to https://news.ycombinator.com/item?id=16413917


You’d use an existing streaming parser, not write your own.


support for un-marshaling json into objects in a streaming way is spotty / inconvenient. it should be easier than it is.


Newlines can't appear in JSON strings (they must be encoded as \n).


But newlines can appear in JSON anywhere whitespace is allowed otherwise.

Yes, you can say “we will avoid this by never generating JSON with newlines.” But why would you do that when you can easily pick a delimeter that’s guaranteed by the spec not to appear?


I’ve often used the jsonlines format for application logging, though I didn’t have a name for it until today.

I appreciate this format because, as a log file, I can use all of my standard tools to work with it (grep awk etc.). Since it’s a text file, I can use standard text editors to view and search the contents.

Since my application is the one that’s creating the log file, newlines will not appear or will be escaped.

I can often accomplish my goal without parsing the objects, but when I need to, I can break out powerful tools like `jq` or RecordStream to query the contents in a structured way. I can also treat the log file as a streaming or archived data source and process it using database technologies like Redshift or Elasticsearch.

I can’t think of another format that would be more convenient. I have worked with log file formats that are actual XML or JSON documents, and they are inconvenient in practice. I can’t think of another format that would be more convenient than this one for structured application logging.


Lots of other tools work on a line by line basis.


But newlines can't appear in JSON-Lines which is what this is really based off of.


I made this choice because so many tools support newline-delimited text. A quick search showed that others (Twitter, eBay) made the same choice, so I went with the flow.

*Thank you for NSBlog, it's great!


You’re welcome!

I think the tooling issue is for debugging and inspection, in which case you can just add a quick `tr` invocation to your pipeline to get it into a form they’ll understand.


>I agree that a streaming response is cool, but why the dismissal of returning valid JSON, streamed?

JSON hasn't been designed for chunked interpretation. How would the client know when to start interpreting the received message? How could you tell the difference between a valid chunk or malformed response?

>With this new-line-delimited JSON format all your clients HAVE to know about your new protocol

Yes. It's probably not a good option for public. It should be a complement to a more general API. I would put this in the bucket of micro-optimization for very specific use-cases. Regardless, if you publish your API and document it, other developers should be able to consume it just fine. It's not rocket science.



> carriage-return (\r\n) delimited JSON-endocded activities

That's not JSON


Why would you say that?

> Insignificant whitespace is allowed before or after any of the six structural characters.

Where whitespace may be space, carriage return, line feed, or horizontal tab. It's explicitly valid JSON.

https://tools.ietf.org/html/rfc7159#section-2


ok - but my point is that it's a protocol, that is more specific than json i.e. streaming JSON in full generality has problems. if you want to restrict yourself to a subset and create a protocol, thats great - but it's not really "streaming JSON"


> JSON hasn't been designed for chunked interpretation. How would the client know when to start interpreting the received message?

Neither has XML and yet we have plenty of working streaming XML parsers.

> How could you tell the difference between a valid chunk or malformed response?

Same as usual, when it breaks.


Streaming JSON is troublesome because the base object is unbounded. It's true that there are workhorses that handle this, but when debugging responses, ad hoc tooling is very useful. If you can't examine a response without using a streaming parser, the cost of maintenance goes up significantly.

It's important to remember why we use JSON. It's not because it's well suited to transmission. It's because it's easy to reason about. If you want to move to a streaming format that is not easy to reason about, you may as well move to a binary format.


curl is a streaming parser. Most tools know what to do with a newline.


I haven't checked, but does Chrome preview tab handle this non-standard format? That's the level where a great deal of JSON debugging occurs.


Last time I tried streaming parsing it was with Oboe.js which wasn't bad but I remember it having approximately a 5x speed penalty. So you get your first item in your dataset faster, but the whole thing loads slower - a tradeoff that's not a no-brainer. I wonder how getline plus many repeated calls to json.parse work in comparison to that - I suspect the native parser probably has some startup overhead but still better than a JS parser.

The other thing was that to get the full benefits of something like this (not necessarily about streaming parsers vs not) you had to rearrange the way your /whole/ stack works, streaming all the way from the database through all the backend layers to the frontend. It's satisfying when it works, but definitely a non-trivial amount of change.


It is a pragmatic choice. Most languages can consume newline-delimited text natively. Do you really think getline() is going away?


His point isn't that consuming data line by line is hard.


JS (and JSON by extension) does not have multi-line strings, so line endings are already escaped. All you need to do is output ‘non-pretty’ json.


Nice catch, thanks! I didn't realize that. Looks like no control characters are allowed in JSON string literals.


Let me preface by saying that I don't have an immediate solution for this case, the details seem to be application specific. However, I will note that JSON is brutally inefficient by almost every computational metric that exists and it therefore is quite expensive unless your problem domain is trivial. I used to be in the position of spending -- literally -- millions of dollars per year on JSON because that was how people wanted to format the data. You can solve many engineering efficiency problems for millions of dollars/euros.

Which is to say I don't have an answer for why streaming JSON isn't valid in this case, but I can also say that if it was up to me I would never us it in an application that mattered. It is much too expensive for many (most?) applications.


This makes more sense when working with files (rather than an API). You can use traditional Unix commands.

Another reason might be that when working with hand-written files, the delimiter adds a bit of redundancy. This make errors like unbalanced parentheses easier to diagnose.


I agree. Keeping this backward compatible would not only simplify client design but also accelerate the adoption. I am not really sure how this would be implemented, though.


It's impossible to stream a JSON Array


Embedding newlines..... hmmmmm \nI’m not sure how one would do that


Somewhat related is the JSON Lines "spec": http://jsonlines.org/


I have nothing but praise to this service. It's fast and efficient, and does the job without bloat.

For instance, their iOS app weighs 888.8 KB! When it's common for simple apps to be 50 MB monsters, it's very refreshing to use something that has been developed with proper care.


Thanks!


You say that using websockets was less reliable than streaming HTTPS, can you elaborate why? In my experience websockets are perfect to use for the use case you described, are there disadvantages?


I tried this a long time ago. At the time, certain queries were hard for a server to answer. Other clients connected to the zombie server stalled until something timed out. With HTTPS, a load balancer can direct new queries to servers that are responsive.


I was also curious about this (and thanks for the article, very informative!)

Are you saying that at one point the servers would crash relatively often, which would leave sockets clients hanging, unless some complicated client-side code was written - whereas without sockets, a load-balancer could automatically switch clients to functional servers, without extra coding, and mitigate the issue? Isn't the problem the crashy servers?


Sure, but why keep state when you don't need to? Now that HTTP/2 is ubiquitous, sending new GET requests is no more expensive than sending bytes through a WebSocket.


Websockets are also bidirectional.


As someone who actually regularly uses a slow mobile connection (8 kilobytes per second!) with somewhat high ping (~90ms to Google), please don't do this thinking you're making my life significantly better. It barely makes a difference in performance once loaded, and the initial load time is ridiculously worse. I'd much rather you make your page work without JavaScript, kept the design light (as this page has otherwise done), and make your CSS cacheable.

Right now, it takes over 5 seconds(!!) for this page to load because of all the freaking JavaScript it has to download! With JS off, the page loads almost immediately. With a keep-alive connection, subsequent loads over HTTPS are not particularly long, unlike what this article seems to think. (Hacker News is one of the FASTEST sites I can access, for example. Even on my crappy connection, pages load nearly instantly.)

Simply letting me type, press enter, and wait 0.1~0.3 seconds for a new page response would not be a significantly worse experience -- however, due to the way the site is written, search doesn't work AT ALL with JS disabled.

So, lots of engineering effort (compared to just serving up a new page) for little to no actual speed improvement, and a more brittle website that breaks completely on unusual configurations... Yeah. Please don't do this!


You should get the iOS app: https://itunes.apple.com/us/app/instant-domain-search/id1068...

It uses the streaming API, and will work well for you.


Thank you for the link, but I have Android, and I'm looking at the performance by tethering to my PC.


You wouldn’t wait 0.3s for the next page - that’d be 0.3s + a few seconds waiting for all the queries to return before showing anything. Streaming let’s you show results earlier.

Most likely the page loads instantly because the server is not doing any real work, which is offloaded to the client/js.


I'm looking at the behavior of the search tool on the page, which makes a request to https://instantdomainsearch.com/services/all/<query parameters> through AJAX and gets streamed lines of JSON back as a result. It's not doing DNS queries from JS or whatever.

Looking at the behavior with curl and wireshark, what I see is that a full, new connection to the service does spend most of its time in DNS lookup and HTTPS handshake. It takes about 0.1~0.3s for the actual data to transfer.

What the article is recommending is basically, don't make a request per JSON object. (It streamed back 69 objects for the query I tried.) Using one connection to transfer the information saves a lot of overhead -- and I don't have a problem with that part of the advice.

What I mean is, instead of using JS at all to do this (and consequently triggering a 5 second initial page load, etc etc.), have the server build the page the traditional way, and send that -- that's still one connection for that data transfer, and with a light page design and keep alive connection, the page load time does not seem like it would be significantly different here (most of the time is going to be in that 0.1~0.3s for the query to execute regardless) -- but the initial page load time would be significantly faster on slow connections.

If your queries do actually take many seconds, sure, maybe there would be a benefit there, but I'm not seeing the value on a page like this, and I really don't want people to take away the idea here that they should redesign their sites to use AJAX to "reduce latency on mobile" by default as it won't help, and in fact, tends to make things worse.


These services seem to often get .com/.org/.net/etc results back immediately (maybe cached or API is just faster) while the longer-tail stuff is slower. Maybe not cached or has to hit one of those weirder tld servers to check.

This is a case where the streaming solution makes sense because you don't want to have to wait til `example.weirdtld` is resolved when you just want to check for `example.com`.

Now, you kinda admitted this in this post which is quite a change from your original post, at which point it seems that you're saying "this isn't always good" which is kinda self-evident.


There were a couple hours between the two posts, and I was actually pissed off when I wrote the first one -- I'd had more time to think about it and cool down a bit by the time I got through with that second one.

As someone who is in the class of people actually affected by developers' choices on how to "reduce latency on mobile", I reacted really badly when presented with a page that had a massive increase in load time because of JS -- when JS was being presented as the solution to a problem on mobile connections. (Frankly, I hate the way JS is being used on the modern web as it tends to make my experience browsing the web absolute hell for a couple weeks each month after I hit my bandwidth cap. I browse with it off by default, and my experience is usually better on the majority of otherwise well-designed websites.)

After reflecting on it more, I acknowledged that the article does make a good point -- don't make lots of small connections to the server if you can avoid it since connection start up time can be large -- and people in the replies reminded me that the average performance I'm seeing isn't necessarily the worst case performance, and problems with the way this site was put together didn't necessarily mean the technique is bad.

Now that more time has passed, I have to wonder why didn't the author just use chunked transfer encoding of HTML instead of JSON for the results? You'd get basically the same effect without needing any JavaScript at all -- and as the author brought it up in the post, he/she is clearly aware of it. That's really the first question I should have asked...


I share your sentiment. But isn't it just their experiment page that uses too much JS as opposed to the approach requiring a ton of JS by nature?


I haven't tried to unpack their JS -- but you're right that it might not need to be as heavy as it actually is here.


Really interesting stuff! What about simply opening a Websocket connection and using that to for all requests if connection latency is such an issue?


I did this a few years ago before I knew it was a "thing" and felt really proud that it actually worked.

The use-case was we had a slow database query for basically map pins. The first ones pins come back in milliseconds, but the last ones would take seconds. The UI was vastly improved by streaming the data instead of waiting for it all to finish, and the server code was easy to implement.

A different delimiter would have worked, but newlines are easy to see in a debugger.


I'd like to see this streaming JSON parser incorporated into GraphQL clients: http://oboejs.com


Any benefit to using this over Server-Sent Events? (other than IE/Edge support)


I think websockets is a much better use case for this if you don't want the reconnection overhead. Also since websockets are bidirectional, you can keep the connection open and send all requests through the connection as well as receive responses from the connection. Also you can send binary on websockets if you want to save bandwidth as well. We do this at work and it works pretty nicely.


How is this handled from a UI perspective? As more applications are built around the idea of streaming data, I've found that UI elements tend to jump around, and I find myself clicking/tapping the wrong item more and more, because the item which had been under my thumb/cursor has jumped away just before I could activate it.


This is a good point. De-bounce, static results ordering, and static elements with placeholders help. Instant Domain Search has room to improve here.


Isn't this pretty similar to how you would use WebSocket frames to transfer individual JSON elements when a client is subscribed?

At one job I had several years ago we came up with the same idea and use \n separated JSON elements as a streaming response. We also tossed around the idea of using WebSockets to stream large responses between services.


Let's talk about reliability: The network is unrealiable; Firewalls might be broken, packets are dropped, IP-addresses may change, cellphones lose connection in subway tunnels. Simply calling "streaming reliable" without even defining what the "reliability" is protecting again, makes "reliable" an overstatement.

IMHO the most reliable way to get data from point A to point B is likely by having a client actively polling for data, using a strict socket timeout. Data should be at-least once delivered. If JSONS should be called anything remotely "reliable" as periodically polling, at least it should have a strict timeout (not mentioned in the article) for receiving the next newline & it should handle replaying of non-acked messages. Otherwise I would call it far from "reliable".


Each request takes a second or two to complete. If you lose the connection, the client can send the same request again (with exponential back off).


Does anyone know a JSON parser that parses an ArrayBuffer instead of strings? [1]

JSON.parse() only accepts strings.

The library that the article recommends also uses XMLHttpRequest with strings. [2]

The reason I'm asking is the maximum string length in 32-bit Chrome.

[1]: https://developer.mozilla.org/en-US/docs/Web/API/XMLHttpRequ...

[2]: https://github.com/eBay/jsonpipe/blob/master/lib/net/xhr.js


If you want to parse more than 4GB of JSON in a browser, ArrayBuffer versus string isn't the most important criteria - you shouldn't parse a 4GB ArrayBuffer as a single chunk either. You can always convert between an ArrayBuffer and a string, but you'll want to compare parsers according to the ability to stream, as well as performance and memory efficiency.


You could look into streaming parsers such as Oboe.js, which specifically support the use case of parsing JSON tree's larger than the available RAM[0]. Then again, when you're loading such huge JSON files into a 32-bit instance of Chrome, it is likely you should look for a totally different solution to your problem.

[0]: http://oboejs.com/examples#loading-json-trees-larger-than-th...


If the concern is HTTPS overhead, why not use HTTP/2 and send multiple requests?

I think streaming would be useful only if the responses are stateful and it's hard to share it across requests.


Even with HTTP/2, sending a GET request is not free. Most of the benefit is that we can show the user results as they come in. Each query gets over 50 responses in random order from DNS, in-memory indexes of zone files, slower fuzzy searches over other data, and so on. Why wait to show them the .com result while .ca resolves?


Feels to me like the response content type shouldn't be "application/json" anymore (it is what's returned on that first example).


IIRC, earlier versions of WebKit wouldn't emit data from each chunk when I sent a custom content type. Or maybe there was some conflict with gzip, I forget. Now that those browsers are gone, maybe ndjson? jsons?


Is there an advantage to using chunked streams like this over using WebSockets like this: https://github.com/wybiral/hookah (this will take newlines from a programs stdout and send them over WebSockets, even aggregating multiple streams into one).


I feel like I'm missing something here. Isn't that the point of using a JSON SAX parser instead of a DOM parser?


How is this different than SSE https://html.spec.whatwg.org/multipage/comms.html#server-sen... ? Or, why would someone choose this over SSE?


Why not just gzip the JSON? Should make complicated JSON around an order of magnitude smaller, and be more portable to boot.


I do gzip the streamed JSON. Try:

$ curl -H "Accept-Encoding: gzip" --trace - "https://instantdomainsearch.com/services/vanity/apple?hash=8...


Why would you use this over web sockets?


Should have (2017) in the title.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: