
On “On Asm.js” - bpierre
http://calculist.org/blog/2013/11/27/on-on-asm-js/
======
pornel
In this exchange UTF-8 got dragged into list of ugly hacks, but it is a
_beautiful_ hack.

Endian-independent, more efficient than UTF-16 for most languages (often
_including_ CJK web pages: halved cost of HTML & URLs makes up for 33% extra
text cost), supports easy and safe substring search, can detect cut
characters, and all that with ASCII and C-string backwards-compatibility.

If I could redesign entire computing platform from scratch UTF-8 is one thing
I wouldn't change.

~~~
pilif
The disadvantage of utf-8 is that it's a variable length encoding. This means
that certain operations which are usually O(1) are O(n) with UTF-8: one is
finding the n-th character in a string, the other is finding the length in
characters (though that's also true for null terminated C strings)

Another problem is that swapping a character in a string might cause the
(byte) length of the string to change, which might force a reallocation (also
slow) and if you're doing it in a loop over the string, your loop-ending
condition might now have changed because the string (byte) length has changed.

In-fact, iterating over a UTF-8 strings characters is not something you can
use a simple for loop any more, but something that requires at least one,
possibly two function calls per character (one to find the next character, one
for finding the end of the string which might just have moved due to your
modification).

Finally, efficiency: For English texts, UTF8 is the most efficient Unicode
encoding, but for other languages, that isn't true. A Chinese text would
require three to four bytes per character, as opposed to just two in UCS-2
(which is what most OSes and languages use, even though it doesn't support
encoding all of Unicode)

For these reasons, dealing with a fixed-length encoding is much more
convenient (and speedier) while the string is loaded into memory. UTF8 is
great for i/o and storage on disk, but in memory, it's inconvenient.

UCS-2 or UTF-16 is the reverse: it's very inconvenient on disk and for i/o
(need I say more than BOM), but in-memory UCS-2 is very convenient, even
though it doesn't support all of Unicode. It's in-fact so convenient that it's
used by most programming environments (see yesterday's discussion about
strings being broken)

Take python 3.3 and later for example: even though they now have full support
for Unicode, also for characters outside of the BMP and thus requiring more
than two bytes of storage, they didn't go with a variable length in-memory
encoding, but they now chose the shortest width possible fixed length encoding
that can be used for a particular string.

This seems like an awful lot of work to me, but they decided the fixed-
lengthness was still worth it.

~~~
PeterisP
Umm, is there any unicode encoding where finding n-th character (not
codepoint) in string is O(1) ? In any encoding you can have a single
'composite character' that consists of dozens of bytes, but needs to be
counted as a single character for the purposes of string length, n-th symbol,
and cutting substrings.

This is not a disadvantage of UTF-8 but of unicode (or natural language
complexity) as such.

~~~
rolux
"UTF-32 (or UCS-4) is a protocol to encode Unicode characters that uses
exactly 32 bits per Unicode code point. All other Unicode transformation
formats use variable-length encodings. The UTF-32 form of a character is a
direct representation of its codepoint."
([http://en.wikipedia.org/wiki/UTF-32](http://en.wikipedia.org/wiki/UTF-32))

Of course, the problem of combining marks and CJK ideographs remains.

~~~
PeterisP
That's the point - you get O(1) functions that work on codepoints. Since for
pretty much all practical purposes you don't want to work on codepoints but on
characters, then codepoint-function efficiency is pretty much irrelevant.

I'm actually hard-pressed to find any example where I'd want to use a function
that works on codepoints. Text editor internals and direct implementation of
keyboard input? For what I'd say 99% of usecases, if codepoint-level functions
are used then that's simply a bug (the code would break on valid text that has
composite characters, say, a foreign surname) that's not yet discovered.

If a programmer doesn't want to go into detail of encodings, then I'd much
prefer for the default option for string functions to be 'safe but less
efficient' instead of 'faster but gives wrong results on some valid data'.

~~~
mercurial
For a lot of usecases, you're just dealing with ASCII though (hello HTML).
Wouldn't it be possible, in a string implementation, to have a flag indicating
that the string is pure ASCII (set by the language internals), thereby
indicating that fast, O(1) operations are safe to use?

~~~
PeterisP
What you say is done with UTF-8 + such a flag - if the string is pure ASCII
(codes under 127) then the UTF8 representation is identical. IIRC latest
python does exactly that, sending utf-8 to C functions that expect ascii, if
they are 'clean'.

But for common what usecases you're just dealing with ASCII? Unless your data
comes from a cobol mainframe, you're going to get non-ascii input at random
places.

Html is a prime example of that - the default encoding is utf8, html pages
very often include unescaped non-ascii content, and even if you're us-english
only, your page content can include things such as accented proper names or
the various non-ascii characters for quotation marks - such as '»' used in
NYTimes frontpage.

~~~
mercurial
It really depends on what kind of thing you are doing. Say you're processing
financial data from a big CSV. Sure, you may run into non-ASCII characters on
some lines. So what? As long as you're streaming the data line-by-line, it's
still a big win. You could say the same for HTML - you're going to pay the
Unicode price on accented content, but not with all your DOM manipulations
which only involve element names (though I don't know by how much something
like == takes a hit when dealing with non-ASCII), or when dealing with text
nodes which don't have special characters.

I'm happy not to pay a performance price for things I don't use :)

~~~
PeterisP
The scenarios you mention would actually have significantly higher performance
in UTF8 (identical to ASCII) rather than in multibyte encodings with fixed
character size such as UCS2 or UTF32 that were recommended above. That's why
utf-8 is the recommended encoding for html content.

Streaming, 'dealing with text nodes' while ignoring their meaning, and
equality operations are byte operations that mostly depend on size of the text
after encoding.

~~~
mercurial
I think we're actually in agreement :)

------
fdej
Naive question: is there a reason why we cannot just compile programs to both
native/bytecode and javascript and have browsers automatically fetch the
version they support? If it turns out that native/bytecode universally runs
(say) 20% faster and loads in half the time, the javascript target will
eventually die a natural death of obsolescence without compatibility ever
being sacrificed. If it turns out that the javascript target can keep up with
the performance native/bytecode, then we can just stop compiling to anything
but javascript, and nothing will be lost. The attitude that we should not even
try to do better than compiling to javascript just seems odd.

~~~
azakai
Well, that seems to be what Google is doing with PNaCl and pepper.js.
Pepper.js uses emscripten to compile PNaCl apps so they run in JS
(specifically asm.js). In Chrome the PNaCl version can run, and everywhere
else it runs in JS. That sounds like what you are proposing?

You can try that approach out right now,

[http://www.flohofwoe.net/demos.html](http://www.flohofwoe.net/demos.html)

[http://trypepperjs.appspot.com/examples.html](http://trypepperjs.appspot.com/examples.html)

Those two sites have the same codebases built for both PNaCl and asm.js.
Overall they run pretty well in both, so this doesn't seem to show a clear
advantage to either JS or a non-JS bytecode. PNaCl starts more slowly, but
then runs more quickly, but even those differences are not that big. And
surely both PNaCl will get faster to start up, and JS get faster to run,
because there is no reason they both cannot get pretty much to native speed in
both startup and execution.

Note though that there are risks to this approach. No one enforces that
everyone create dual builds of this nature, and there is no guarantee that the
dual builds will be equivalent. So while this is interesting to do, it does
open up a whole set of compatibility risks, which could fragment the web.

~~~
cdash
Except for the fact that you can not compile threaded programs to javascript I
am pretty sure while you can to PNaCl and this is my problem with this talk of
asm.js, until it supports threads I don't really consider it an acceptable
solution.

~~~
CmonDev
And people keep mentioning those 'Web Workers' as if Emscripten is capable of
generating them from LLVM thread abstractions.

~~~
flohofwoe
Nope, it's a better idea to create a higher level thread-pool-based parallel-
task-system which abstracts away the differences between pthreads and
WebWorkers. I think that at least most game engines have such a system in
place anyway, and can be relatively easily adopted. YMMV because it may be
more overhead to get data in and out of WebWorkers since they don't have a
shared address space (so in that regard they are more like processes).

------
jheriko
"On a shared medium like the web, where content has to run across all OSes,
platforms, and browsers, backwards-compatible strategies are far more likely
to succeed than discrete jumps."

This is a valid point, but hits on something that constantly grates with me.
We already have something massively more cross platform and performance
focused than JavaScript - C. The problem is that the standard library for C is
utterly lacking, and there is a lack of focus on developing mechanisms to
deploy native code /directly/ across the web.

Don't kid yourself either - all the data being thrown around at the moment is
always (always, always, always otherwise it wouldn't even work!) translated
into native code in some way - we are already throwing native code around the
web, just in a horribly inefficient way. In todays world of sandboxing and
virtualisation all of the security arguments that used to be completely
reasonable are quite thoroughly invalidated.

I really do believe that fixing this from the technical perspective is not an
impossible or even hard problem to solve - I can not stress this point enough.
Game developers are constantly re-solving this problem in limited,
optimisation focused ways. JavaScript implementations themselves are
ultimately built on top of this technology or other technologies built on top
of it. The standard libraries of modern scripting languages are possible to
use from within the constraints of C. We have already solved all of these
problems and there are myriad examples.

On the other hand, the political problem could be intractable... and thats
sad.

Worst of all perhaps, C can be made better for performance without much
thought or effort, and none of the new languages I see actually tackle this
problem, which is quite real and measurable and impacts everything from
productivity to the environment - they (quite rightly in many ways) focus
utterly on ease of use and massive standard libraries. Why can't we focus this
effort on fixing C, or providing a better alternative?

Why are we trying to catchup with native code performance instead of using
just native code? Why aren't we improving native code performance in a serious
way?

~~~
derefr
Native code is, at the bottom of its ladder of abstraction, always held at the
behest of some vendor's platform or another. Microsoft, Apple, and Google will
never agree on a set of standard C libraries that could be used to build rich,
modern graphical applications. If they could, we'd never have needed the web
for anything other than the ideas of hyperlinks and intents--everything else
could be done with URL-specified zero-install applications. (This was the
future portended by Hypercard.)

But things didn't go that way. Instead, each vendor built its own walled
garden, with its own platform-specific libraries to do the same things.

The web, in this reality, is simply our attempt at encapsulating away the
entirety of the OS, along with all its platform-specific libraries, as a sort
of really-big-BIOS, and then building vendor-neutral platform on top with APIs
that will work on _every_ computer, running _every_ OS.

But the fun thing is, once we get this web platform nailed down, and it can do
everything? Once the OS beneath it is redundant in every respect? We can
"molt" the outer layer away.

The future of Operating Systems will be the lineage of today's ChromeOS and
FirefoxOS, not Windows or OSX.

~~~
jheriko
"Microsoft, Apple, and Google will never agree on a set of standard C
libraries that could be used to build rich, modern graphical applications."

this is probably very true, but it isn't even a real problem. the standard
library can be expanded to encompass rendering, audio, networking and co.
regardless as to platform nitty gritties we can implement a layer over the top
(lots of people do this to make games already).

when i look at the web as an approach to being platform independent i see
something which is truly quite poorly constructed. the kinds of bugs and
problems in the webstack are utterly alien in my world... libraries to work
around browser bugs? laying out objects on a screen 'complicated'? its just
shoddy all over... especially the browser implementations and the myriad
frameworks piled on top of javascript and CSS.

coupled with the staggering loss of performance and increase in complexity I
have no desire to use web tech to develop my cross platform products - and my
native development moves at an extremely rapid pace.

i qualify this very heavily with "i know how to do this and have done it alone
and in teams many times more often than once"

a great example of this is everything already in the C standard library which
- whilst it seems simple today now its done - hides a great deal of OS and
platform specific details which vendors still disagree on but programmers in
that environment are largely (and rightly) unaware of. a quick look into the
Win32 apis, X11, BSD socket implementations or the Cocoa/Objective-C stack on
OSX will show up just how different the platforms can be in their 'low level'
interface for many things.

~~~
pcwalton
> laying out objects on a screen 'complicated'?

As jwz said in 1998 [1]:

"Convenient though it would be if it were true, Mozilla is not big because
it's full of useless crap. Mozilla is big because your needs are big. Your
needs are big because the Internet is big."

I think any document and application layout system that handles all the use
cases of CSS is going to be about as complex as CSS. Certainly PDF and
Microsoft Word .DOC are up there in terms of complexity.

[1]: [http://www.jwz.org/doc/easter-eggs.html](http://www.jwz.org/doc/easter-
eggs.html)

------
avmich
I see the point; but I don't think it has enough justification. Something
evolutionary successful isn't always globally perfect - rather it's most
probable along the path from there to here.

JavaScript may stay long with us. Or - as sometimes happens - it could be
eclipsed in few years, as happened with technologies and paradigms before.
Say, guessing correctly on technologies in use in 2000, sitting in 1990, would
be quite hard.

Not that "more direct approach" of defining bytecode and standardizing should
always be better. With ideas disruptive enough many bets are off.

~~~
derefr
But it seems to me that "evolving into" the use of other languages _through_
Javascript, will allow us to start ignoring the fact that it's Javascript in
particular that we're targeting--such that, one day soon, the browser-makers
will just give us a "shortcut" to doing the things that have already evolved,
without all the Javascript-y mess in-between.

But something has to _get_ popular _first_ as a new "open web-scripting
language", before the browser-makers will be willing to all go in on
supporting it. (Otherwise you get the reactions you see by, e.g., Mozilla to
Google's NaCl.)

And that was a chicken-and-egg problem until now, because you can't really
create _and force universal adoption of_ an "open web-scripting language" (or
framework, or platform, or bytecode, etc.) if you're just one company. But
now, with asm.js, you can--and the rest of the steps will follow soon after.

------
sirsar
> _On his impossibly beautiful blog_

I found the parallax background incredibly distracting. Maybe I just lack
focusing skills, but I would much rather read a plain text file than have a
slightly laggy background change every time I scroll.

~~~
SkyMarshal
I'm more than happy to make the effort to read a blog that pushes the limits
with JS/CSS/HTML the way acko's blog does, even if it marginally decreases
readability. I spend more time reading the source than the articles anyway.
Think of acko.net as a tech demo first, blog second.

------
dgregd
> I see plenty of reason to keep betting on evolution

So why Mozilla helped to kill WebSQL? Because NoSQL (IndexedDB) is so much
better than SQLite?

Mozilla is too much focused on Brendan Eich baby. They should really move
forward because JavaScript is becoming new IE6.

~~~
azakai
WebSQL was a completely separate debate, if I recall correctly, the issues
there had to do with WebSQL being heavily dependent on SQLite, a single
implementation. So it was hard to spec in a vendor-independent way like web
standards require.

IndexedDB is much less capable than SQLite, no doubt about it, but still very
useful, and far far simpler and feasible to spec and standardize (which it has
been).

~~~
iso-8859-1
But SQL is already standardized! They just have to determine a reasonable
subset of the existing standard.

~~~
azakai
SQLite is not identical to any of the SQL standards. For example it is
typeless

[http://www.sqlite.org/datatypes.html](http://www.sqlite.org/datatypes.html)

which is different than most RDBMSes.

------
skwosh
Onanism?

------
oddshocks
Damn that site is gorgeous yo

