I just find reproducible notebooks at the internet. It is really rare to find them from coworkers. If they aren't trained as developers, it is almost impossible. Their solution for this problem looks really efficient and is really simple and brilliant:
> Writing Polynote’s code interpretation from scratch allowed us to do away with this global, mutable state. By keeping track of the variables defined in each cell, Polynote constructs the input state for a given cell based on the cells that have run above it. Making the position of a cell important in its execution semantics enforces the principle of least surprise, allowing users to read the notebook from top to bottom. It ensures reproducibility by making it far more likely that running the notebook sequentially will work.
Thanks for the kind feedback. It's a young project to be honest, but I'm pretty proud of what we've done with only two contributors so far. With community participation I think we could support many more languages pretty quickly!
I really have always wished for reproducibility. Thanks for taking up this feature. How do you handle aliasing and references inside objects? Suppose I have
#Cell 1
a = [1,2,3]
b = (a,True)
#Cell 2
b[0][0] = 5
#Cell 3
print(sum(a))
Now if I change Cell 2 to
# Cell 2'
b[0][0] = 4
and execute, Cell 3's result becomes stale. Do you track such dependencies? Would really love to read more about the underlying implementation.
If you mutate an object itself, we can't really track that. There's no magic going on; you can break the state if you use mutable objects. It's less of an issue in Scala where immutable data structures are the norm, but I can imagine it would be disappointing in Python.
Currently it takes a shallow copy of the state output by each cell, meaning every value is going to be a primitive value or a reference. If it's a reference to mutable state, you're kind of on your own with respect to keeping reproducibility. I felt like this was a good compromise between strictly enforced reproducibility and practicality; if it turns out to be confusing we could consider deep copying the state, or having an option to do that (I could imagine it being pretty bad for efficiency in a lot of ML use cases, though).
According to the article, the most interesting feature compared to Jupyter is no hidden state - if you delete a cell, the variables it set are gone. Also, you can mix languages - you'll be able to access variables filled by prevously executed cells in another language.
Personally, I'm looking forward to trying out the SQL support. I haven't seen an elegant solution for SQL notebooks in Jupyter, it was always second-class via Python or some such. Or have I missed something?
Interesting. Judging by that it seems to be implemented with a JVM language and a screenshot shows "Scala" as a supported language, I'm guessing at least all the JVM languages are supported (personally hope for Clojure) but can't seem to find a list of supported languages anywhere in the post or on the website.
Currently just Scala and Python (via jep). Looking to add more (probably starting with Java and clojure) but haven't had time yet. There's just two of us working on it so far. PRs welcome!
Yep, we will be adding a plug-in to support Graal languages (little bit of learning to do on Graal first). We didn't want to make the project depend on Graal, though, because it's a pretty small segment of users (and we're not using it on our team at this point).
I don't want to over-promise anything given that I still have some reading to do about Graal's inner workings. But even given the interpreter side, there's also some plumbing to do (e.g. Monaco integration) that we haven't thought out yet. It's still in its infancy, and we'll need some help to be able to add stuff like this.
The SQL support is done through Spark, so it's not particularly novel – Zeppelin for example supports SQL similarly. We've talked about adding a more general SQL interpreter, though. Happy to hear any suggestions about it!
Do you know of any generalized SQL interpreter that allows push-downs to the underlying engine where possible, but can also arbitrate compute resources to post-push down operations. Eg: such as merging disparate result-sets or make up for the lack of features from the underlying engines.
Closest thing that comes to mind is something like Apache Drill, which coincidentally also uses Apache Calcite as the SQL interpreter.
Also wondering why I would use this over Zeppelin which can support other interpreters like Flink?
I like this as a concept, but the JDK / jep requirements are a bit of a turn off, personally... I understand they want it to speak Spark but that's not exactly how I would imagine it worked from the name or the "polyglot notebook" description
While the reproducibility problem is definitely a issue, I'm not sure it's such a big issue that I'd switch to a whole different notebook solution for it. For most notebook scenarios, running from scratch works fine to ensure it reproduces. Apart from this one feature, BeakerX does all the same things and fits a lot better into the existing jupyter ecosystem.
To be clear, we're not out to supplant Jupyter. Anybody who's happy with their Jupyter setup will likely find little value in Polynote. But it has plugged some gaps we've had in our Scala ML research team at Netflix, so we thought others might see some value as well.
Somewhat off-topic, but what's with the lambda replacing the "n" letter? I'm no expert in Greek but I thought lambda was the equivalent to the letter "l"...
The logo was hastily designed by an amateur (me). I figured most people would figure it out, pedantic people would complain, and we'd all have a good time :)
We've had some better options contributed in the past couple of weeks, but as long as we're going to change it I didn't want to rush that. So we stuck with my questionable typographic treatment for the blog post.
See, we tried that, and to me it just looked like "ponynote". So far everyone who's mentioned "polylote" has been a current or former physicist, so maybe there's an interesting correlation there...
I did the upside-down lambda with the right-side-up lambda because I thought it had a neat yin-yang look to it. Probably should have thought about how it would read to someone who reads Greek :picardfacepalm:
It does! Monaco is one of the many awesome open source libraries that made Polynote possible. We'll be discussing that at Scale by the Bay; check out our talk if you're going!
It seems like the tool was mainly invented to deal with the issue of hidden state in notebooks, but I don't honestly see what the big deal is. Jupyter notebook is a tool with hidden state being a gotcha that you can learn how to deal with extremely quickly. I've been a Jupyter notebook for several years so haven't had this problem often in recent memory, but I've led workshops where we teach users how to use the notebook. Inevitably hidden state issues come up, but students very quickly learn that restarting the kernel is a necessary part of the workflow and figure out when they need to do it.
It's not a successor; nteract is a separate project (part of the jupyter ecosystem) and is alive and well. Polynote was started mainly to support use cases of our Scala-based ML engineering teams. It's a little bit apples and oranges.
I have a love hate relationship with how R studio deals with hidden state in notebooks. If you want to export an .rmd file to pdf, you have to run the whole notebook from start to finish in order, sorta proving that the thing is reproducible before export (maybe there is some technical reason as well)
It's nice know that your report actually worked, and is not showing something odd because of a hidden state, but sometimes you just want to print the darn thing now!
I need to resist the urge to package this as a standalone app. I don't really like the idea running a separate server and having an editor tied to a browser, but wrapping everything in an app bundle with WebKit views seems like a nice side project.
I wish someone would make Jupyter alternative, but native, without the need to run webbrowser, css and bunch of JavaScript just for simple rendering task. Something based on Qt, GTK, or anything native and crossplatform.
A couple of years ago I would have totally agreed with this, and I had several abortive attempts to do something like that before starting Polynote. The problem is, it turns out that a notebook really needs to be able to display a bunch of really heterogeneous rich output. There's only one pre-baked way to support that, and it's HTML. So you can either embed HTML in your UI, or embed your UI in HTML. At least on the JVM (and at least for me), it turned out to be easier to do the latter than the former.
Gave this a try and it looks very promising. It would be great if GraalVM was integrated to extend the polyglot support to JavaScript, Ruby, R, in addition to Java, Groovy, Kotlin, and Clojure.
Maybe I'm not the right audience but why would notebooks need to have no hidden global state, and be reproducible? I personally use notebooks as a way to jot down things I would've tried in a REPL. Notebooks aren't meant to hold pieces of software; they are a dump of my explorations. I have a hard time understanding some of the requirements that went into the design of Polynote.
So interesting to see something like this right after vscode added jupyter notebook support this past month, which I was excited to see given how poor the editing experience is in standard notebooks, especially around intelligent autocomplete.
There are lots of cool developments in IDE notebook support - IntelliJ just dropped a plug-in for it as well. To be honest I'd be thrilled if an IDE solution could fill all the gaps that Polynote's targeting (our work would be done!) so I'm looking forward to seeing what develops.
Another somewhat off-topic comment: Someone please do this for stock prices / SEC data / financial modeling, with the ability to output into PDF (or PPTX) and you will conquer the world.
No, these are outputs that the client pays us to generate for them. These specific instances that I linked were from an example that was made public and then filed with the SEC, but usually these decks remain private
This isn't exactly XBRL as the calculations are all bespoke, done for the purposes of these specific presentations
My vision is separating financial analysis from formatting, or bringing CSS / markup languages into the world of Excel and sidestepping PowerPoint entirely if possible
PowerPoint is a slide presentation tool that is sadly used for authoring so-called "books" with lots of formatting in Investment Banking, Management Consulting and Corporate America at large... and its shortcomings become immediately obvious
As we speak (!!!), I'm having to reformat a book from our standard template into the client's template by manually resizing charts, recoloring series, etc... literally wasting hours of my time because we use Excel + PPT
Client templates happen, but they aren't the norm. I pointed that example out as one egregious waste of time, but simply conforming Excel charts / tables to the company's format is very time consuming on a day-to-day basis.
A lot of people just don't have an intuitive sense for design or plain don't care, so you're left with a high variance in the quality of the work that is produced. Speaking from my banking experience, the result is senior bankers have to spend time marking up documents on things like formatting, and junior bankers feel frustrated because they waste their productive hours on non-productive work. And it's generally a frustrating experience indeed as the tools we use aren't built for this purpose.
But formatting is just part of it. "Code" reusability (more like financial model reusability) is basically zero, auditing is a pain and people use many different approaches to do the same things, but again with varying levels of efficiency and accuracy. I can tell you 10 ways to calculate Total Shareholder Returns for a public company using the FactSet add-in in Excel. Also several ways to pull Revenue and EBITDA figures, and these aren't even the more esoteric metrics like Free Cash Flow or Funds from Operations.
These decks that we prepare should work like recipes. Input the ingredients and out comes the prepared dish, but everything is so god damn manual
Not to mention issues with Excel itself, like how Excel's (non-cascading) styles and defined names (similar to variables) propagate to other workbooks if you copy-paste content from one to another, which in the long-term creates files full of garbage that crash and corrupt frequently
The whole paradigm is a shit show. One day it will be different, I am certain. But nobody has taken a holistic, multi-disciplinary view at the problem because bankers / consultants don't know what is possible with today's technology and the ones trying to address these usability issues are only looking at a couple pieces of the puzzle at once
I was curious about this problem so I went digging around.
As I understand it, financial companies often want to gather data from multiple places but consolidated in a digestible form to make financial decisions. (what kind of financial decisions, I'm unclear on, since most trading is done by computers now, but maybe this is for ETFs or OTC trades) And they're use to looking at it on PDFs and Powerpoint, because that's what people email around, and no one trusts having financial slides on the web. (why, btw? In case they leak?) And because you have many clients that want the same or similar sort of analysis on publicly traded companies, you'd ideally be able to change the analysis once, and then generate all the PDF and PPT reports they want to see.
It does seem like a giant waste of time to cut and paste data from excel into powerpoint by hand. However, you should be able to export Excel data to Power Point via the Visual Basic Editor. (https://www.wallstreetmojo.com/vba-powerpoint/) Do people not use VB to prepare this?
I don't get the impression that analysts in finanace would be willing to move from Excel to a notebook. Do you get a different sense? What would a notebook have to offer to get them to switch? Analysts generally seem to love Excel, with the exception of the slow first load and crash.
> As I understand it, financial companies often want to gather data from multiple places but consolidated in a digestible form to make financial decisions.
Yes, this is spot on.
> (what kind of financial decisions, I'm unclear on, since most trading is done by computers now, but maybe this is for ETFs or OTC trades)
Sales & Trading is but one part of banking. Yes, a lot of S&T is automated, but long-term strategy isn't defined by computers and neither is pitching to win new businesses. Besides S&T, there's also Restructuring (advisory and financing) and Mergers & Acquisitions. My opinion is written from the perspective of an M&A banker. I probably make ~3-4 PPT books every week on average.
> And they're use to looking at it on PDFs and Powerpoint, because that's what people email around, and no one trusts having financial slides on the web. (why, btw? In case they leak?)
Concerns over leaks is certainly a driver, but traceability and auditing also play a role. 3 years from now, I can certainly retrieve info that was attached to an e-mail but a link may have long expired. Also most recipients are over the age of 40 so might not like using links in general
> It does seem like a giant waste of time to cut and paste data from excel into powerpoint by hand. However, you should be able to export Excel data to Power Point via the Visual Basic Editor. (https://www.wallstreetmojo.com/vba-powerpoint/) Do people not use VB to prepare this?
Certain tools like the add-ins provided by FactSet, S&P's Capital IQ and, less commonly, Bloomberg, export data into PPT with some metadata attached to it that allows you to refresh content quickly (in theory). It's all built as plugins on top of MS Office apps so your experience is not always smooth. Plus they don't solve the bigger issue of reusability
> I don't get the impression that analysts in finanace would be willing to move from Excel to a notebook. Do you get a different sense? What would a notebook have to offer to get them to switch? Analysts generally seem to love Excel, with the exception of the slow first load and crash.
I think Analysts like spreadsheets with very responsive UIs and countless hotkeys. Muscle memory is a major thing in this business.
My vision of a solution would be one that still implements spreadsheet-like functionality but does so in pieces that connect together all the way through publishing via an integrated paradigm
And to be clear, such paradigm could sit on top of Excel. It's the tooling around it and the workflow that I think will be solved. Where the boundaries of what a new app / system lie exactly is open to debate, however
As someone with some amount of influence over the direction of Excel, I'd love to sit down with you to better understand this scenario better. Would you mind if I contacted you off-HN? My contact info is in my profile. Thx!
Investment Banks, but maybe sell the functionality to a large provider like FactSet or Capital IQ who already have their packaged software licensed to big financial institutions
Alternatively build something more tightly integrated to MS Office and sell it to Microsoft
The notebooks are just .ipynb files (Jupyter's format, though apparently it doesn't like our notebooks very much...). So you can certainly store them in git. We don't have integration yet, but it's on our roadmap.
> Writing Polynote’s code interpretation from scratch allowed us to do away with this global, mutable state. By keeping track of the variables defined in each cell, Polynote constructs the input state for a given cell based on the cells that have run above it. Making the position of a cell important in its execution semantics enforces the principle of least surprise, allowing users to read the notebook from top to bottom. It ensures reproducibility by making it far more likely that running the notebook sequentially will work.