Hacker News new | comments | show | ask | jobs | submit login

Hm very interesting. But to play devil's advocate, what's wrong with the Java approach or the v8 approach? Is it just that they are suboptimal in terms of efficiency, or is there something that SDE can accomplish that they can't?

The Java approach is to do parsing, semantic analysis, and compilation to bytecode on one side of the network. Then JIT compilation is done on the other side.

The v8 / "modern JS" approach is to do gzip compression of source code on one side of the network. Then decompression, lazily parse, and lazily generate native code on the other side.

Now I recently look at v8's parser, and it's fiendishly complex [1] and somewhere north of 20K lines of code -- and yes that's just the parser.

So there are downsides. SDE might be more elegant and efficient overall (?) But when you are dealing with networked code, you can't change both sides of the wire at once. So we are sort of stuck with suboptimal interfaces and brute force on both sides.

That said, I do think there is a good argument to be made for shipping plain text source code and using brute force to make parsing and analysis in the browser fast. (That is, writing ridiculous amounts of hand-tuned C++ code.)

[1] https://www.youtube.com/watch?v=Fg7niTmNNLg&t=2s




There's nothing "wrong" with the Java or V8 approach per se. It's not even a given that SDE would do better today. Note that most of Franz' own work has since been on the JVM.

I just think SDE is worth more exploration. It lost out not because it failed to compete on merits with the JVM but because the JVM basically cut short any hope of it getting mindshare at the time.

It's possible the JVM or V8 approaches are simply superior designs. But we don't really know, because such huge resources have gone into making the JVM or v8 approaches work, and not just SDE but most other approaches from that time became near instant dead ends.

> The v8 / "modern JS" approach is to do gzip compression of source code on one side of the network. Then decompression, lazily parse, and lazily generate native code on the other side.

The interesting part of this is that you could treat SDE simply as a way to short-circuit the parsing and compression step by using it to serialize the AST and still retain exactly the same code generation. You can lazily generate native code from it, as long as you apply the heuristics to build the dictionary on load time (necessary to rebuild the correct tree).

Whether it'd be worth it is an open question, but I wish I had time to explore it.


I'm actually exploring something "spiritually" related: currently, I have a project where I store the state of an app in the URL, by compressing trees of JavaScript objects by first putting them through a pre-compression "filter" step to make the data more compressible (not going into that now, but see [0]), then apply JSON.stringify, and finally put the resulting string through lz-string to create an URI-safe string[1]. lz-string is basically a wonderful, wonderful hack to Lempel-Ziv compress strings to a bitstream, which is then turned back into a string.

I'm basically looking at ways to cut out the JSON.stringify step. So basically, first turn a JSON-friendly object into a binary representation, which is then compressed into a bytestream, and stored in a URI-safe string.

This might lead to something generally usable, and faster/smaller than lz-string. I know JSON alternatives like flatpack and messagepack exist, but this would be a much simpler, lighter JS library, single-digit kilobytes in size before gzipping. It also would probably be backwards compatible all the way to IE6.

[0] https://github.com/linnarsson-lab/loom-viewer/blob/master/cl...

[1] http://pieroxy.net/blog/pages/lz-string/index.html


That's interesting. Yes, I'd encourage looking at how SDE functions there, as SDE is ultimately using Lempel-Ziv-Welch that operates directly on a tree instead of a serialized string, and where the heuristics for what to add to the dictionary can include templated tree subsets.

That part of it is really independent of the code generation aspect, and could easily be adapted to serialize arbitrary types of tree structures.

Basically what you need is a general mechanism that takes a possibly templated rule, and adds to the dictionary. Then you need a mechanism that looks at a JS object and instead of spitting out JSON, matches the current node to one of the dictionary items and spits out a bit pattern and then recursively spits out the bit pattern for the template arguments. The decoder can basically be very simple recursive descent: Identify the dictionary entry, look it up, find the template rule, recursively call itself to retrieve the template patterns, and reconstitute the object.

The "magic" then lies in the heuristics, which can be very generic (e.g. suitable for any JSON) or very specific (e.g. embedding domain knowledge of what the trees are likely to be shaped like, or at least code to be able to recognize and encode complex shapes when seen).


OK, fair enough!

FWIW here is a long thread I started about an AST serialization method, with some prototype code: https://www.reddit.com/r/ProgrammingLanguages/comments/79fkp...

It is lightly "compressed" -- there are no native pointers, but offsets instead. The in-memory format is the same as the on-disk format. You can mmap() a file and then start following "pointers" right away.

I didn't develop it for transmission over a network, but it could be used in place of Python's .pyc files, which essentially cache parsing and bytecode compilation, and are a big tree/graph data structure. It can represent arbitrary graphs.

You could add dictionary compression in a tree-aware way or a tree-oblivious way... not sure how that compares to SDE.

FWIW someone suggested WebAssembly on that thread, but it's apparently no longer an AST -- it's bytecode (I guess more similar to Java.)


That's interesting. I think where SDE goes further is that you can go full lzw with it. The "dictionary" here is not a static dict, but a dynamically built dictionary as with lzw compression that is expanded as you parse the stream.

That makes it harder to traverse - you can't jump to arbitrary locations without decoding everything before it, because you need to look at everything to apply the heuristics to build the dictionary. But it makes it likely to be far more compressed - e.g. you can have dictionary entries shorter than a byte, and you can build dictionary entries that represents whole subtrees, and in SDE's case those subtrees can be templated.


Hm, but if compression ratio is the only difference, then I don't see that as a bottleneck in most language systems?

I think that overall startup time for example is a more interesting goal, not just the amount of data transferred over the network or streamed off disk.

Startup time can be saved by doing parsing and semantic analysis on one side of the network. And if you know your target architecture, you can do code generation there too, although that doesn't seem to be the way any systems have evolved -- even Android was doing native code generation on the device recently.

I guess I am having trouble seeing situations where both of these are true:

1) A compression algorithm specific to a file format is say 50% or 100% better than gzip or xz.

2) That 50% or 100% results in a big user-perceived improvement.

Most bytecode formats are "lazily" decoded, although they can be somewhat large. I sort of like the scheme I proposed because it's both compact and can be lazily decoded. You could make it more compact, but then you would have to decompress the whole thing up front.


It's not just compression ratio, but that it's exploiting the nature of the compression mechanism so that the decompression step effectively translates to a depth first traversal of the tree, during which it generates the code.

> I think that overall startup time for example is a more interesting goal, not just the amount of data transferred over the network or streamed off disk.

I agree, but to that end size is a major factor, and if the tree is structured properly nothing stops you from being able to start executing the code during generation ("just" generate stubs for missing functions that will wait for or trigger generation of the missing function)

> Startup time can be saved by doing parsing and semantic analysis on one side of the network. And if you know your target architecture, you can do code generation there too, although that doesn't seem to be the way any systems have evolved -- even Android was doing native code generation on the device recently.

Agreed, and SDE will do the parsing and semantic analysis on the other end too.

> I guess I am having trouble seeing situations where both of these are true:

But that is not the goal. The goal is to generate architecture independent code in a format that is very fast to generate code for. The compression and reduced load time is just a happy side-effect of speeding up the code generation by structuring it for reusability of code fragments during generation.

As such the decompression is "free", as it's nothing more than very lightweight dictionary lookups and insertion during the code generation.

> Most bytecode formats are "lazily" decoded, although they can be somewhat large. I sort of like the scheme I proposed because it's both compact and can be lazily decoded. You could make it more compact, but then you would have to decompress the whole thing up front.

Lazy decoding is possible as described above, though I agree that ability to do "random access" when generating the code is a benefit. Systems like SDE certainly do depend on outputting the code in a sensible order. Though incidentally you can get a lot of that simply by carefully deciding on how to traverse the AST to generate the output. E.g. traverse outer definitions, then the main entrypoint and output any obvious dependencies first.


Isn't WebAssembly arguably a kind of "slim binary"? It's basically a binary form AST, no?


I was actually thinking the same. WebAssembly certainly brings us a step closer to the SDE approach, though I haven't looked at it enough to be able to say how close.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: