Hacker News new | past | comments | ask | show | jobs | submit login
Stencila: An open source office suite for reproducible research (stenci.la)
256 points by muraiki on July 13, 2018 | hide | past | favorite | 49 comments



> The calls for research to be transparent and reproducible have never been louder. But today's tools for reproducible research can be intimidating - especially if you're not a coder.

As someone (a software engineer) who has been trying (struggling) to reproduce biology research lately, I say amen. Hallelujah.

But. It's time to accept coding as a core skill. Science has more to learn from software engineering than it realizes. Software engineering (aka coding) eats reproducibility for breakfast, even when hundreds or thousands of "collaborators" are involved. These days, it's rare for a single biology researcher to produce (publish) code that is easily reproducible by an external researcher.


It's a bit more complex than that, as it's not just having access to the software, but to the whole environment. The move towards reproducible builds and configurations (Nix and the like) is a good thing for this, as well.


I've come across this problem before (as a third party - I don't do research) and I've considered writing software to help solve it. Maybe something like a Jupyter notebook with an attached mini-filesystem that can be easily shared with colleagues? What do you think would be a required to solve this?


If you're really interested in this, you should contact Ivo Jimenez and Felix Z. Hoffmann. Although there's not much information available, under the heading of " Reproducible Computational Research – a case study " at [0], they tentatively explored such an idea at a hackathon in Cambridge in May.

[0] https://elifesciences.org/labs/bdd4c9aa/elife-innovation-spr...


It depends a lot on the field. In my own work (I'm out of academia now, but it still applies), there's often very complex network setups involved, so a Jupyter notebook won't quite cut it. Many things in this particular domain can be nowadays solved with SDN, but it is still a complicated issue. I'm sure people from other fields have analogous problems, too. And this is in CS, I suspect that anything which implies more "physical" setup is orders of magnitude more complex.


I'm baffled how "coding" equates to software engineering. If anything, scientists probably need to listen less to current Research Software Engineering dogma, and RSE types need to learn more about science. (Few will even take measurement seriously in my experience.)

The "reproducibility" mantra is at odds with a lot of real world science and serious computing. You don't/can't in general reproduce the sort of physics experiments and large-scale calculations with which I'm most familiar the way people are suggesting, and software engineering can't address bad science or lack of information. Revision control and "notebook" interfaces seem to have become the equivalent of waving XML metadata at any problem from the days when e-science was preventing useful work and research. Experience from a non-trivial research record and some decades doing and supporting research computing will be ignored, though.


The issue is not to repeat the experiments, but rather to avoid putting the logs in the wheels of people who want to re-analyze your existing data (including yourself a couple of years later) and avoid losing thousands of working hours doing forensic data analysis to point out shitty science. Keith's Baggerly 2010 talk on the hoops he had to jump through to get to Anil Potti (https://en.wikipedia.org/wiki/Anil_Potti) is a great demo of the application case: https://youtu.be/7gYIs7uYbMo.

And as for "doing real science" vs. trying to make it more reproducible, there is an excellent analogy with "doing real programming" (aka adding features) vs. refactoring and architectural adjustments. Telling that you consider the second as a waste of time tells more about yourself than about the subject.


This looks interesting. I'm actually rather surprised at the lack of quality in most open source 'office' alternatives. I'm not sure why, but open/libreoffice is almost unusable for heavy duty office work. Microsoft products are the clear winners if you require productivity and stability. I really hope that changes soon, because I feel it's a major factor that holds back widespread Linux adoption.


There's a ton of mindnumbing testing and grinding work that honestly no one really wants to do, involving bit-level accuracy in many cases. We (https://sheetjs.com/) are attacking the file compatibility problem (our major open source project is https://github.com/sheetjs/js-xlsx, used in stencila's spreadsheet converters) and it is painfully obvious that OO/LO cuts corners at certain places. For example, the mathematics of Calc is intentionally bugged:

> ignore the last two bits for many stuff to improve the user experience.

https://bugs.documentfoundation.org/show_bug.cgi?id=83511

A successful alternative to Microsoft Office has to start from a level of compatibility that no current developer in existing solutions has expressed interest in attaining.


As you referenced that bug and there also gave an example involving fractions and =0.1+0.2 you would certainly also agree that the binary floating point representation of =0.1+0.2-0.3 is not 0.0, still Excel displays 0 as result (and so does Calc) instead of 5.551115123125783E-17 because users expect that.

Your bit-level accuracy approach isn't as simple when it comes to user experience and Excel compatibility.


Regarding your fraction example there:

=RAWSUBTRACT(0.1,-0.2,1/3) => -0.033333333333333

=RAWSUBTRACT(0.1,-0.2,2/7) => 0.014285714285714

So which one is closer to 0.3?


Just a quick word of thanks for SheetJs from a Stencila dev! All the "mindnumbing testing and grinding work that honestly no one really wants to do" is obvious and much appreciated. Your efforts are so important for the ensuring interoperability but are often not recognised because they are "in the backend".


> it is painfully obvious that OO/LO cuts corners at certain places. For example, the mathematics of Calc is intentionally bugged

What about Gnumeric?

My understanding is that the dev team (mainly Jody Goldberg really) have focused on accuracy. The gnumeric.org site has links to studies related to its accuracy.


Interesting. In your experience, do you think heavy duty, in-browser office applications are now viable? I'm beginning to think they are, especially with the likes of web assembly becoming more ubiquitous.


I'm not sure why, but open/libreoffice is almost unusable for heavy duty office work.

What do you define as "heavy duty office work"? What would make OpenOffice more suitable to you needs? Asking as an AOO contributor. I'm genuinely curious to know what you think the project is lacking or how it needs to be improved.


Firstly, I would like to thank you for your work. Open office has served me very well during my school and university days.

Since joining the world of work, I've struggled with the following:

- Random crashes every couple of hours when files become large. Especially in Impress and Calc.

- I do a lot of diagrams in OODraw. Sometimes I'll reopen the file to find arrows and connectors have moved.

- The interface could be more intuitive. For example, the colour selector is limited and clunky, as is the gradients widget. Why can't we add any colour we like without going to the options menu?

- Application speed is slow when files get large, especially Calc. I've even tried increasing memory per object.

- Filtering and sorting in Calc not as fully featured or easy to use as Excel.

- Conditional formatting not as easy to use.

- Calc shortcuts not as easy to use as Excel (eg, I don't think there's an easy way to select a column without a mouse, or transpose a selection, or remove blanks etc etc). With Excel, I can pretty much achieve most things without touching a mouse (I'm an ex investment banker, and we get pretty familiar with the keys). OO seems to lack these critical shortcuts.

- Poor documentation of OO scripting environment. It's tough to figure out how to automate simple things in Calc.

This is not an exhaustive list, but it's all I can think of currently.

Again, I am very supportive of the AOO/LO effort, but I wished it would start giving Microsoft more competition in the power user category.

Thanks!


> investment banker

As someone who was a software engineer in a brokerage and had to deal with clients excel sheets: open/libre office not letting you do that is a feature.

We would be given excel sheet that would depend on a specific version of Excel. We had sheets in the hundreds of megabytes. We had sheets that would take overnight to run. We had sheets with sheet dependencies. We had sheets that needed to be run in a specific order manually. We had sheets with 13,000 manually entered rows, each with 53 columns.

These things, and I use the term thing since eldritch abominations does not convey the horror of using them, were responsible for investing hundreds of millions of dollars.

The day excel dies is a day I will celebrate.


Before Python, before Perl, before I grew up and got a real job, I worked in a lab that relied upon the Ingres spreadsheet application to reduce experimental data. At the time, it was a Macintosh-only application, and its scripting support was actually very good, but validating the dsta was a very slow process and ran overnight.

I re-wrote the process in AWK, and was able to complete the data-reduction task in about 60 seconds on a DEC 5000 workstation. But my numbers did not match the spreadsheet results, and finding out why was an interesting process.

I then encountered small, remote offices of big international firms -- usually real-estate, insurance companies -- that had limited local IT support. It was Excel macros all the way down.

And then a large biotech firm, where it took KPMG analysts a few months to determine that our entire business relied upon this one guy's spreadsheet.

So yeah.

But:

Excel dies, let us suppose... What will these people come up with? Alexa queries?

I have preferences other than Python, but mostly I csn read code written by the scientists who wrote them. They may be abominations, like global state variables and magic numbers, but they are readable.

Is Python any reason to hope for a better world?

(Hmm. Off-topic, perhaps. Your story woke some memories of tough IT experiences.)


> Excel dies, let us suppose... What will these people come up with?

Ideally, excel with separated actual use cases, which the original app merges into one thing:

- data entry into tables

- auto-generated tables over some ranges (dates, counts)

- constants / described values (they end up on a side, sometimes in a named cell if you're lucky)

- presentation/report layer

I think Access did some good things, but is too close to be database to be comfortable. BI to is great for the last part, but it's a separate/expensive product.


Yes, I too have seen those spreadsheets, but it's not Excel that's at fault, it's the users. I've taken XLS behemoths and trimmed them down to a fraction of their size and optimised sheets that ran overnight, to recalculate in seconds. I get what you're saying about Excel giving too much power in incapable hands, but I feel, for certain use cases, OO is too little power in even the most capable hands.


I keep saying this to my friends: I dream of the day when people exchange SQLite files instead of XLS(X), and Excel, and the like, are just an interface to that database. This way you get data integrity (Excel randomly truncating digits or converting numbers to scientific notation is the bane of my existence), data present in raw form, and being interface/view layer agnostic.


- I do a lot of diagrams in OODraw. Sometimes I'll reopen the file to find arrows and connectors have moved.

Interesting. I use Draw a lot and haven't had that hit me yet. Any chance I could convince you to post something about this to the AOO mailing list, or open an issue in Jira about it? (that is, assuming there isn't already an issue for this).

Poor documentation of OO scripting environment. It's tough to figure out how to automate simple things in Calc.

Agreed. I just went through that exercise myself, when I was working on a thing at work to use the Jira API to query stories, and pull stuff into a spreadsheet for analysis. I was able to figure it out, but it wasn't easy. From what I've seen, the information is all there, but it's not necessarily organized / accessible enough. There's also not enough tutorial format stuff. Hopefully I can write up some stuff based on my recent experience and get it out there.

but I wished it would start giving Microsoft more competition in the power user category.

Same here. I have a laundry list of things I'd like to try and add to Calc to make it more powerful for complex analytics and what-not, but I am so busy on other projects I haven't had time to pursue those ideas much.


For the last few years most of the work has been happing in the LibreOffice fork. If OO crashes on you, try LO.

https://lwn.net/Articles/699755/


I use it almost exclusively.

And it crashes. It crashes and loose your work.

Regularly.

I love that such product exists. I donate to libre office.

But it's nowhere as reliable as msoffice. Any complex enought document will make it crash at some point.

It's not even reproductible. You can't pinpoint the action that made it crash. Sometime you do the same other and other and it's fine. Then it destroy work and crush your soul.


Not disagreeing with you, as everyone has different workflows etc. but my experience is really different. I've found LibreOffice entirely suitable for heavy duty work, completely fine.

Microsoft products crash inexplicably on our systems too, and in fact, we have much weirder problems with corrupted instances and things than we've ever had when using LibreOffice. I also have serious problems with the Office UI, basically that the ribbon UI as it's implemented is inconsistent and incoherent, which is frustrating as hell. If you're going to have a ribbon/tab UI, make it consistent throughout, and don't have special exceptions for some functions.

Having said that, I do think LibreOffice loses to Microsoft's edge in some areas of polish, like in implementing equations, and in drawing figures. Some of the UI with the drawing actions are much more intuitive in Office.


Try softmaker office


Not sure why that page does not include a link (that I can find anyway) to this intro video that the creator, Michael Aufreiter, put together:

https://www.youtube.com/watch?v=EzrR96PDnO8


Interresting!

Does Stencilla offer any kind of author collaboration? Publication are usually not one person efforts and research teams oftentimes are not working in the same location.

In my experience the GDocs or Word comments and revise mode are heavily used in collaborations.


Thanks for the feedback. Comments and "track changes" are a commonly requested feature, important for collaboration, and certainly something we want to implement!


I was wondering that too. It seems like there might be collaborative editing, given dat project involvement?


Yes, we'd love to have real-time collaboration, preferably in a decentralised way using Dat!


tl;dr - an interactive notebook, that accepts Jupyter files, for folks who usually use MS Office.

From their FAQ:

> Stencila allows you collaborate with colleagues who use other tools than RMarkdown and Jupyter Notebook, without you having to give up your favourite tool. Stencila Coverters make it possible to open documents in various formats (R Markdown - Rmd, Jupyter Notebook - ipynb and so on) in Stencila. The conversion is lossless for all interactive parts (such as code cells).

Nice to see dat part of the conversation.


Those coverters are really good for undercover work.


How's stencila supposed to produce reproducible research?

No one's fudging the formulae, they're fudging data. And stencila will digest whatever data you give it, real or fake


Hear, hear! Even if they're not fudging them as such, they may have cocked up the measurements. You figure that out from hard experience in the field (or relevant experience from another, which is often very helpful), not with some trendy piece of software.


Has anyone else noticed this trend of integrating data science tools into one package that installs and potentially monitors you? For example this seems like another `Anaconda Navigator` clone. Or am I missing something here?


I'm surprised that there appears to be no link to the Github Repo on the homepage: https://github.com/stencila/stencila


There is a link in the "Contribute" section.


If any contributor sees this: The "Learn" link in the footer points to a 404 Github page.

Is this usable for daily usage? And can we output PDF and use LaTeX for publications?


Hi, Stencila dev here. In my opinion, Stencila is not ready for daily usage. But we are looking for beta testers who are willing to put up with bugs and crashes and help us shape the framework. We have converters, based on Pandoc, which are able to convert to both PDF and Latex - although they are also in preliminary development.


So this is a reproducible research application suite built on Node.JS, the epitome of non-reproducibility and characterised by fast pace, little care for compatibility, and an ecosystem of volatile libraries?


Hi, Stencila dev here, thanks for the feedback. There are several modular components that make up Stencila. The user interface modules are indeed built using Javascript, and Node.js is used in a number of places including the desktop and CLI apps and format converters. But for code execution, Stencila does not rely on Node, and users can use R, or Python or SQL.

> Reproducible research depends on reproducible execution, which depends on a reproducible environment, which depends on a reproducible set of libraries and frameworks.

Completely agree. We are trying to make it easy for people to use reproducible libraries and environments. To this end, we are developing Nix environments (a highly reproducible way of defining computing environments) which include Stencila "execution contexts" for R, Python etc with standard libraries included. These environments can be connected to the user interface. See https://github.com/stencila/images/ for more details.


Thanks for the explanations!


Also the most widely used language 6 years in a row: https://insights.stackoverflow.com/survey/2018/#most-popular...


That by itself doesn't count for much in this context. Web based technologies tend to move quickly while scientists are almost all using math libraries that were written in the 60/ 70's. To scientists these things are just tools and they prefer stability, reproducibility and minimal dependencies over using what's popular.


I doubt there’s much that can be said her to sway anyone’s mind.

Either way, wish the project luck! Reproducible scientific studies is paramount.


Not really. Reproducible research depends on reproducible execution, which depends on a reproducible environment, which depends on a reproducible set of libraries and frameworks. I don't think Node is apt for that. The community values new over stable, and cool over technically superior. That might have its value, but I do not believe those are valuable traits for a project of this sorts.

That it's popular means nothing. Windows is the most popular operating system, pop is the most popular genre of music, oil is the most popular fuel, since decades. This doesn't invalidate their utility or our appreciation of them, but it does not mean they're good for everything.


Uh, sure (eyeroll)


So this is jupyter for "excel", right? Nice!




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: