Hacker News new | comments | ask | show | jobs | submit login
Perspective: Streaming pivot visualization via WebAssembly (github.com)
377 points by texodus on Feb 6, 2018 | hide | past | web | favorite | 130 comments

I wrote the C++ part of this codebase, would be happy to answer any questions about it here.

Did you experiment with doing it JS-only prior to the C++ version? If so, what kind of performance increase did you see with WebAssembly?


WebAssembly can be certainly faster than JS in certain situations, but well-structured, monomorphic JS with the kind of type-hints for numeric values that asm.js is based off of can be extremely fast as well.

It'd be great to check the assumption that WASM is actually faster in this case, and by how much, especially given the friction of sending anything other than numbers across the JS/WASM boundary.

And to be fair, you'd have to build it with something like prepack to get some of the optimizations LLVM gives you.

No, when I first prototyped it I had written a Python version initially. From memory the first C++ build was ~80x faster than the Python prototype.

80x faster than using Pandas? What if you just used the Numpy values array?

You will get me into trouble! I said 80x faster than the Python prototype. I did benchmark this vs Pandas over the years and depending on the usecase the two codebases trade blows with each other for static data. However for streaming datasets Perspective has a large advantage over Pandas since pandas does a full group by on every update, while perspective does its work incrementally using deltas.

I'm curious if you did any profiling? If the intent was always to rewrite the whole prototype in C++ then it kinda wouldn't matter, eh?

We do some benchmarking just for catching performance regressions and validating assumptions against the non-WebAssembly version, but definitely need more in this regard. [benchmark for wasm build](https://jpmorganchase.github.io/perspective/examples/benchma...)

I'd also be interested in some performance comparisons with a library like CrossFilter [1]. Does the improvement outweigh the penalties of crossing the JS/WebASM boundary?

[1] http://square.github.io/crossfilter/

The boundary-crossing is definitely the bottleneck right now. We are currently putting alot of work into the Apache Arrow support specifically to avoid this crossover, which will allow us to send data from the server in binary and avoid parsing in the browser.

Can you elaborate on how that could work? Does arrow really allow for abstracting away the need for serialization even in JS - server scenarios? I though it was more of a shared memory data frame utility ?

There is an early arrow example in the examples package - superstore-arrow.html. The idea is that, instead of converting your data to internet->text->JSON->ArrayBuffer, you just keep the data in binary and write it directly into the C++ heap ArrayBuffer as-is. We currently do not read this in C++ directly for various reasons related to how emscripten allocates memory, but the general idea is the same.

bringing Apache Arrow in the browser alongside wasm is exciting to say the least! Amazing capabilities are coming to browsers...

We (Graphistry) recently contributed a native JS reader/writer into the Apache Arrow project, so may help both teams! We did it as legwork for our beyond-native efforts (GPU cloud streaming) and taming our JS datatypes, so similar needs here I'm guessing!

Funny enough: was in NYC today talking with banking teams about related tech. Too bad we didn't know about this effort, would have loved to meet!

Yes, this is the library we use.

We have met before actually, you did a demo at JPM in midtown several years back. Graphistry has come a very long way since then - impressive work!

Ah, small world. That was probably when we first started on client<>server GPU streaming. Looking forward to digging into the Perspective source!

Drop me a note the next time you are in NY, would love to meet up.

In addition to what texodus has said below, Crossfilter only implements a small subset of what perspective provides. For example, no streaming (although there is a related project from the Heer lab that supports incremental updates - https://github.com/jheer/datavore), only a single level of grouping, only in 1 dimension, and you can only support 16 dimension fields at once (without increasing a constant in the codebase).

I see it's possible to embed it in jupyterlab, it seems to do a lot:

* grid format * graph format * actual pivoting of data

Is that right?

It's using webassembly so in effect to do the pivot you must have created functions that exist in pandas?

Has anyone ported pandas to webassembly?

Is the grid editable? What are you using to create your output, html or canvas or something?

I realise I could look this stuff up but since you asked..

I believe you'd need to port Python to WebAssembly to make much use of pandas - unaware of any projects attempting the former.

The grid is just a plugin for the excellent [Hypergrid](https://github.com/openfin/fin-hypergrid), which is editable, but you'd need to manually push those edits back to the engine for now.

WebAssembly code can talk to the rest of your JS (and therefore asm.js) code, right?

So PyPy.js runs Python in the browser via asm.js:



Livebook uses PyPy.js to make a Jupyter-like notebook that runs entirely in the browser, and includes pandas:



And random plug for Observable in case you haven't seen it, a Jupyter-like Javascript environment, completely in the browser:


As historical background: Python was one of the early emscripten asm.js experiments: https://github.com/kripken/emscripten/tree/master/tests/pyth...

Nuitka can compile Python to LLVM IR, which can be compiled into WASM, I think.

which answers my other question, so it's canvas?

The frontend is actually pure JS/HTML - the WebAssembly part is just the data engine, and runs in a WebWorker for CPU isolation.

Hypergrid is a canvas-based renderer, yes.

Briefly in the animation I see there's hierachical (or multi-dimensional) data used. Could you tell me about what sort of data (statistics, monitoring) this supports?

Also, what methods are available for getting data streamed [into Perspective]? I'm skimming through the docs, but am not quite getting it...

[edited for clarity]

Right now, you can call a load() or update() method on an engine or widget to pass JSON, CSV or Apache Arrow data (as an ArrayBuffer) to the engine.

Apologize for the documentation clarity - this is our first release and very much under active construction!

Ok, thank you for this. I'll have to see some time about that Arrow support. JSON and CSV is already plenty enough.

I didn't have time for a proper look yet, so no worries. If at all possible post the code and data (or something like it) that is in those animations you have. That would perhaps allow users to dive in by tinkering here and there.

Just stalked you on linkedin with a similar question, but I was wondering if you were involved with the actual process of getting the library open sourced, and how that works at a large security focused company like JPMorgan Chase. (former Chase employee)

Thanks for taking questions. For a back end developer that's watching the front end landscape change from afar..

Do you feel like web assembly could potentially replace JS in the future with, say, my favorite back end language of choice?

Not the op, but there's more to frontend development than JS. You need to account for the API of browsers, the DOM, CSS and the myraid of frameworks and libraries built for the specific UI requirements of frontend development (e.g. React). So even if your favorite language of choice becomes compilable to WebAssembly, you'll still be learning an entirely different way of development that will be completely unfamiliar regardless.

The only thing I see languages compiled to WebAssembly taking over is development of highly resource intensive applications, algorithms or libraries. Or if there's a quantum leap with some specific language + framework that eliminates a lot of work with current frontend development. Otherwise, there will be too much fragmentation and you'll see the regression to the mean effect as has been the case with predecessors such as CoffeeScript, Elm, PureScript, ReasonML, and every other compile-to-JS language du jour.

A paradigm shift could happen and what's cool is that it could be entirely something new. This reminds me of the early days of the WWW when the first programming paradigm was Perl/CGI, then Cold Fusion, then Legacy ASP, and then ActiveX, and so on.

The possibility of some company creating, from scratch, a new web development paradigm, given everything we know today, is exciting.

Sure, and based on historical precedent, that's almost guaranteed to happen. But I doubt it will be whatever-my-favorite-backend-language-of-choice-is and more likely some-domain-specific-language-and-new-framework. In other words, we'll all be relearning everything again 5 years from now.

Java applets were supposed to provide that in the 90s. Netscape was going to actually embed Java in the browser as an alternative to JavaScript, but there wasn't enough time, so they just shipped with JS, and Java was provided via a plugin.

As a fellow backend developer I certainly hope so. I think web assembly gets us very close. Only two things I missed were direct canvas access from C++ without going through a JS shim and access to some kind of blob storage api in the browser.

Does it support:

virtualization (showing only a view of full dataset that is updated from server on scroll)

Pivoting (multiple levels/hierarchies on rows and on columns)

Apart from features what part did you consider was the biggest hurdle? What browsers did you target?

It supports virtualization, and it is utilized by both plugins (Hypergrid and Highcharts).

N levels of pivoting on both axes.

Should work on all browsers - if not, please open an issue!

By virtualization I meant also not loading entire dataset to memory, but being able to work only on a constrained view of the data. For example, database has 1TB dataset, I only view 100 MB, further columns/sections are loaded when I scroll.

It is virtual in the sense that it does not realize the entire dataset in the JS side of the WebAssembly bridge - this is a performance optimization but entirely in-memory. While you can run the engine in node.js and use it to efficiently stream updates to a symmetric engine in the browser, we do not currently implement server-virtualized views.

Though, we have quite alot of experience doing this in the past - and the design of Perspective is very much a reaction to what we learned, at lesat in regards to the typical financial dataset which is much smaller than 1TB.

Is there a way to drill down a node dynamically, by clicking on row header during pivoting, and then see its children?

Not through the UI yet, only in the engine API - but the pivoting itself is quite fast, so should still be suitable for drilling down "on the fly" so to stpeak

What was the best/worst part about getting it to work in WebAssembly?

Best part was the validation of a design decision of having minimal dependencies. It made the web assembly port almost trivial.

Worst part (which has improved dramatically) was the rough edges in the web assembly toolchains early on.

Is Emscripten the only game in town for WebAssembly C/C++ toolchains, or are there other contenders?

Even outside of C/C++, Emscripten seems to be required in some fashion for just about every "compile X to webasm". Which i'm not that happy about as it's quite heavy and takes a long time to compile, which makes it harder for newcomers to try it out because it requires a fairly large time investment to even compile hello world.

> Even outside of C/C++, Emscripten seems to be required in some fashion for just about every "compile X to webasm".

One exception to this is Rust's new "wasm32-unknown-unknown" target, which uses LLVM to directly generate wasm files without going through emscripten: https://www.hellorust.com/news/native-wasm-target.html

The easiest way to use it for larger projects is probably via https://github.com/koute/cargo-web, which handles a lot of packaging very nicely.

It's really nice to use, assuming you already know Rust.

Yeah I'm super excited about Rust's new webasm compiler, as well as some new-ish stuff with .Net that works without Emscripten.

I get the ecosystem is super young, but it was honestly what kept me from playing with Webasm for a while, because each time I'd sit down to play with it I'd be in for a 3 hour compile after spending an hour getting my windows machine to have the right compilers and not stomping over other things I have setup on here.

Sounds like you were seeing an old emscripten SDK bug, where it compiled LLVM+clang unnecessarily and with debug info (which is very slow).

Currently there shouldn't be anything like that, it will download a binary build of LLVM+clang for your system. It should be ready to use immediately after that download.

Rust’s wasm32-unknown-unknown target does not use emscripten at all. The tool cargo-web will handle the js module shims for you.

cargo-web uses emscripten under the hood.

No, he is right. cargo-web supports three backends, of which only two include emscripten and the third using a native rust backend. https://github.com/koute/cargo-web#features

Rust can be compiled to wasm without emscripten using wasm32-unknown-unknown.

Does that mean I can install cargo-web without emscripten or do I still need it even though I am not going to use it just because the two other targets use it? In other words, is emscripten an optional or a hard dependency?

Its an optional dependency. You can build without it by just using a nightly version of rust with the wasm32-unknown-unknown backend.

An example of using cargo-web with the stdweb rust project: https://github.com/koute/stdweb#running-the-examples

As far as I know yes for C/C++ this was the only viable toolchain that is publicly available.

I see it's possible to convert to Numpy arrays. Is `perspective` also a Python library?

Yes for much of its life perspective has been a python library. It offers a streaming dataframe abstraction in Python. I believe the Python bindings have not been open sourced yet.

That sounds super useful. Does it create its own window as a Python library or am I missing something?

The Python bits are unreleased and the python usage is restricted to desktop visualizations, so no you are not missing anything.

Can you say how old is this codebase and what was its original business case?

Codebase started in 2013. Original business case was to allow business users to create filtered/aggregated views on top of streaming data.

What's on the roadmap?


Offtopic :)

Really cool to see non-software companies putting cool projects like this out as open source

the stock market is digital, they are a software company.

Having worked at a branch off a university devoted to its online school and content marketing where the inputs and outputs of production are entirely digital, I can say they sure as hell aren't a software company; nobody above on-the-ground grunts think much about even the big pictures of domain modeling / workflows, data ingestion / transformation, computing / efficiency / performance, software development lifecycle / product portfolio management / QA, growth hacking / optimization / testing etc. because they simply aren't interested in the slightest. I'm sure plenty of organizations with at least one foot in pre-internet industries are in the same boat, and it's exciting to see counterexamples.

I think it is more correct to say that a large subunit of JPM works as a software company -- for its internal use. All modern financial companies are heavily tech oriented, even those you wouldn't normally think of that way. The best tech firms I've ever been involved with financial firms from front desk trading on back. One of the best tech teams I've ever met with has been one of Bloomberg's teams dealing with their internal core services and bond platform (I was totally floored at how amazing these guys were). These companies are too large to say they are entirely software, but there are definitely Directors and CTOs that live and breath all those things you mention.

Sure yes, in modern business environments, everyone's business interacts at least in some way with computing, digital products like media and software, and the internet. Even if you had a remote industrial site cutoff from the wider internet, there would be some automation to be done and some code to do it. Every organization should have a CTO, if nothing else than to make procurement and outsourcing decisions. Organizations shouldn't outsource that which differentiates them, and there's lots of room for differentiation in the quality and degree of automation, thus internal software teams. Etc.

All I'm getting at is that those present set of truths, that that reality hasn't spread everywhere yet, and it's still worth celebrating when it has and when it goes well.

There are plenty of people given plenty of power in leadership that won't bring themselves to understand technology and have yet to retire.

You would be pleasantly surprised, then, that this reality has already permeated the modern financial industry. For instance, see http://www.businessinsider.com/goldman-sachs-wants-to-become... . While there are still “relationship bankers” in expensive suits, most acknowledge that software is key to giving them an edge over the competition, and that building a culture around software only helps them.

J.P.Morgan is not "the stock market" and financial companies are not software companies. They are not selling software. They are selling financial services.


They consider themselves a banking company. It's about the culture, so it is indeed refreshing to see non-traditional tech companies open sourcing things.

JPMorgan IT department must be larger than the majority of SW companies out there

"With 40,000 technologists across 14 technology hubs around the world, there are endless opportunities to create what’s next"


This is great. Finance people especially love this sort of thing - they do want to browse through entire datasets vs seeing higher level metrics in a lot of cases.

> Finance people especially love this sort of thing - they do want to browse through entire datasets vs seeing higher level metrics in a lot of cases

To expand on this, a lot of finance is searching out and understanding, normalizing or exploiting edge cases.

This is why excel is king. just give me a grid with all the data and a pivot table

Except when Excel won't manage nearly all the data. I'll sometimes use Igor Pro† before coding something up however.

† (only because I'm familiar with it - I'm positive there are better GUIs)

I love seeing new UX on Web Assembly. Someone needs to port or create a standard layout engine like WPF to build on this. If the tools were there, C# + VS Code + solid layout engine...I'd dump JS + HTML + CSS in a "flash".

There are some attempts on that direction.


Sure. I've played with Blazor. It's very raw at this point and just a side project. I'd like to see MS Research or even a full blown R&D team at MS go full force at this.

Microsoft just announced that they will start investing more into it, depending how the community takes it.


my question is how do game engines do styling ?

"J.P. Morgan open sources..." -- that's so cool

Amazing, despite the fact that some heavy-lifting is done by highcharts and hypergriid.

What's missing for this to effectively kill all typical BI tools like PowerBI, Tableau?

Expensive sales teams. Also neither of those handles streaming data.

The backend.

@dman Cool, but I'm curious why WebAssembly was chosen for this project. What was the value of using it here versus just using the Web API?

The following features made wasm compelling here

a. Filter / Pivot / Aggregate moving clientside means a lot of load off the servers.

b. Performance in wasm is surprisingly close to same C++ code running natively on the desktop.

c. Ability to reuse existing C++ codebase and use it to build both native and web apps.

The C++ probably wasn't written for a browser. More likely it was written for a native frontend and has been ported to the browser via WebAssembly.

What do you mean when you say the Web API?

Presumably, the point of using WASM and service workers is for performance, but I see no benchmarks. I also have trouble imagining that this actually improves performance, unless you’re doing all of your compute in the browser and it’s CPU bound (this seems like bad design).

What is the performance of doing it this way vs. without WASM and workers?

Until we see numbers, this smells like a recruiting play.

For ticking data being pushed into bespoke pivots, running this on the backend and, say, pushing renders to the frontend isn't generally better. You're not likely to be sharing much of the grunt. Also, latency will be an issue so you need to be pretty careful with GC runtimes that aren't optimised for latency.

> What is the performance of doing it this way vs. without WASM and workers?

@dman says in another thread that the original was in Python and this was 10x faster. If this met their performance goals, it does the job. It may be possible that another way is better, for some definition of better, but who cares if it works as required?

> Until we see numbers, this smells like a recruiting play.

It's presumably OK not to use it if you don't want to.

And what if they think this may help recruiting? Seems like a reasonable trade to me.

We use Web Workers principally to separate data CPU load from rendering load, as the datasets we deal with update very frequently and are quite large, and e.g. Highcharts can take on teh order of 100ms+ on detailed charts.

We have some light benchmarks we use for regression testing, but definitely need more work in this area


First, let me say this looks awesome! I’m curious if your considered React.js for the presentation layer (i.e. the js/html5 parts) - on the surface it seems like it would be a good fit - and if so, what you saw as the benefits or drawbacks of it vs the approach you went with.

We chose to go with a Web Components based interface for compatibility across frameworks, but alot of where we go in the future will be determined by the expertise of the developers we hire to work on it, and the community if there is interest there.

First of all, great work everyone involved! I know first-hand how hard it is to push something like this past compliance ;)

Going with WC is a sound decision since its trivial to wrap it into ember, react, or any other framework-du-jour... especially so since more and more of them are adopting WC patterns and design practices.

I might try to wrap this into react and will see if bosses permit opensourcing it!

I hadn’t realized you were using Web Components. That makes sense. Thank you.

Will defer to texodus on that - he is the front end expert here.

For those interested, keep in mind the only way to use WebAssembly with CSP in Chrome is by turning on 'unsafe-eval'. FF/Edge/Safari all at least support compilation from URLs with more locked down policies

True - worth pointing out that this library gracefully falls back to ASM.js if your browser does not support WebAssembly, though.

Does it fallback if Window.WebAssembly doesn't exist, or does it fallback if WebAssembly.instantiateStreaming fails?

The former. In practice, it seems some runtimes are still a bit buggy - so we also revert to ASM.js if e.g. you are on an iPhone

This is seriously cool. Unfortunately I have no use-case for it in the product I currently work on, but still fascinating to flick through the source code and play with it.

really cool. would love to see a Rust version.

build it

What are you using for UI, any framework (Polymer, D3) or just plain WebComponent?

Plain web components. The UI is quite light in terms of functionality so far.

Does the C++ code use vectorization? If not, is it planned?

So, what there is about visualisation?

I read your docs, and found nothing about how to actually render stuff to the page.

I followed the directions to the letter, and everything installed correctly on MacOS. When I tried the same on Linux, though, I got tons of the following errors on the build step (keep in mind I do have boost development libraries installed in /usr/include):

(Sorry about the formatting.... I tried <code>). Anyway, TL/DR it cannot find boost.

$ ./node_modules/.bin/lerna run start --stream lerna info version 2.8.0 @jpmorganchase/perspective: > @jpmorganchase/perspective@0.1.1 start /home/dj/usr/src/perspective-clone/packages/perspective @jpmorganchase/perspective: > npm run compile && (npm run perspective & npm run compile_test & npm run compile_node & wait) @jpmorganchase/perspective: > @jpmorganchase/perspective@0.1.1 compile /home/dj/usr/src/perspective-clone/packages/perspective @jpmorganchase/perspective: > mkdir -p build build/wasm_async build/wasm_sync build/asmjs && (cd build/; emcmake cmake ../; emmake make -j8; cd ..) @jpmorganchase/perspective: -- Configuring done @jpmorganchase/perspective: -- Generating done @jpmorganchase/perspective: -- Build files have been written to: /home/dj/usr/src/perspective-clone/packages/perspective/build @jpmorganchase/perspective: Scanning dependencies of target psp @jpmorganchase/perspective: [ 1%] Building CXX object CMakeFiles/psp.dir/src/cpp/base_impl_win.cpp.o @jpmorganchase/perspective: [ 2%] Building CXX object CMakeFiles/psp.dir/src/cpp/arg_sort.cpp.o @jpmorganchase/perspective: [ 4%] Building CXX object CMakeFiles/psp.dir/src/cpp/calc_agg_dtype.cpp.o @jpmorganchase/perspective: [ 5%] Building CXX object CMakeFiles/psp.dir/src/cpp/aggspec.cpp.o @jpmorganchase/perspective: [ 8%] Building CXX object CMakeFiles/psp.dir/src/cpp/aggregate.cpp.o @jpmorganchase/perspective: [ 8%] Building CXX object CMakeFiles/psp.dir/src/cpp/base.cpp.o @jpmorganchase/perspective: [ 9%] Building CXX object CMakeFiles/psp.dir/src/cpp/base_impl_linux.cpp.o @jpmorganchase/perspective: [ 10%] Building CXX object CMakeFiles/psp.dir/src/cpp/build_filter.cpp.o @jpmorganchase/perspective: [ 12%] Building CXX object CMakeFiles/psp.dir/src/cpp/column.cpp.o @jpmorganchase/perspective: In file included from /home/dj/usr/src/perspective-clone/packages/perspective/src/cpp/calc_agg_dtype.cpp:11: @jpmorganchase/perspective: In file included from /home/dj/usr/src/perspective-clone/packages/perspective/src/include/perspective/calc_agg_dtype.h:12: @jpmorganchase/perspective: In file included from /home/dj/usr/src/perspective-clone/packages/perspective/src/include/perspective/schema.h:13: @jpmorganchase/perspective: /home/dj/usr/src/perspective-clone/packages/perspective/src/include/perspective/base.h:29:10: fatal error: 'boost/unordered_map.hpp' file not found @jpmorganchase/perspective: #include <boost/unordered_map.hpp>\n ^~~~~~~~~~~~~~~~~~~~~~~~~ </code>

If anyone finds this project exciting and is interested in learning more about working on Open Source at J.P.Morgan, feel free to send me a message - we are always looking to hire experienced, passionate talent!

+1 - Even though I do not work there anymore, I can vouch for the fact that the team has amazing engineers and works on fundamental CS problems.

I would like to work on cakeshop-- how do I get on touch with you?

Ah right - texodusmedia at gmail dot com should work for now.

I don't know the Cakeshop developers personally, but it looks slick!

For the blockchain projects, please email quorum_info@jpmorgan.com

Hey man, two thumbs up on this. Wall street as a contributor to open source. Who do I thank? Trump?

Poor naming choice. Google Perspective has been around for a year or two I believe.

in that case, its actually a good naming choice, as this existed (in a closed source form) for much longer than that.

I'm not knocking the code...I think it's brilliant. I just think you're going to cross wires with Google here. Will perspective mean "filtering troll comments" or will it mean "displaying analytics in web assembly" is an unfortunate set of choices.

The name had been in place internally since 2013, so the naming ship had sailed a long time ago.

Given appropriate context, I find it hard to see someone confusing the two.

you'd be surprised how many naming collisions we have, I assume most teams that code-name their software hit the same issue when they go to open source

Go was already a programming language when Google steamrollered the name!

It's open sourced from an investment bank, so you'll never see it updated again. Beware of integrating it into your projects.

Good on whoever took on legal to get this out the door though. Hope you make it to MD or whatever it is you were gunning for.

We very much plan to continue developing this entirely in the open! It is reputationally important we do not just dump our dead projects on the internet as Open Source, which is one of the reason why we chose Perspective for this project in the first place.

you can see the stats for yourself. its under active development

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact