- Little incentive for researchers to do this beyond their own good will.
- Most ML researchers are bad writers, and it's unlikely that the editing team will do the work needed (which is often a larger reorganization of a paper and ideas) to improve clarity.
- Producing great writing and clear, interactive figures, and managing an ongoing github repo require nontrivial amounts of extra time, and researchers already have strained time budgets.
Maybe you could convince researchers to contribute with prizes that aligned with their university's goals. Just spitballing here, but maybe for each "top paper" award, get a team together to further clarify the ideas for a public audience, collaborate with the university and their department and some pop-science writers, and get some serious publicity beyond academic circles. If that doesn't convince a university administration that the work is worth the lower publication count, what will?
In the worst case it'll be the miserable graduate students' jobs to implement all these publication efforts, and they won't be able to spend time learning how to do research.
In the short term, Distill's editorial assistance will help authors produce outstanding papers, although they need to be willing to work as well.
In the longer-term, I'd like to explore match making between data visualization people who would like to get into machine learning and machine learning researchers publishing papers.
And in the very long term, I think the right solution is to add a new component to the research ecosystem. Just like we we have people who specialize as research engineers, theoreticians, and experimentalists, I'd like to have a respected "research distiller" specialization. Eventually, I'd like to try and start special grants for research groups to have someone focused on this.
Two Minute Papers on YouTube:
I have actually been meaning to write a paper on my findings and have been looking for journals to write for. However it doesn't quite "fit" with most journals. Distill looks like it's more catered to "professional" machine learning people, at least for now. Is there any way that somebody with my background (design+data viz+development+interest and curiosity to learn ML) could be involved with Distill?
Absolutely. We know a number of leading ML researchers who would love to publish papers as Distill articles but don't have the design/data vis skills. We'd like to facilitate collaborations which would lead to data vis people co-authoring cutting edge research papers.
Or would said facilitation be done by the admins / editors / steering committee? If so, then how do you plan on finding dataviz people? I'm asking this in particular because I would imagine that people who have ML findings to talk about would probably contact you ("I researched such and such, and found such and such. Now I would love to publish in Distill"). But I wonder if data visualization specialists would do the same thing. Contacting with "hey, I love data viz, would love to collaborate with somebody looking for one" feels a little inappropriate to me.
As a data viz person, I would be absolutely thrilled to work on this, I'm trying to scratch time here and there to position myself better in that respect, learning more and trying to bridge that gap.
And I rarely cover recent work.
> In the longer-term, I'd like to explore match making
> between data visualization people who would like to
> get into machine learning and machine learning
> researchers publishing papers.
I'm a junior faculty working in ML with no personal knowledge of web development, d3, etc. While the papers currently on Distill are absolutely gorgeous and will be an invaluable tool for learning advanced ML concepts, I simply cannot see myself or my students putting the time to actually create something like that.
Unless a student is especially adept at the specific tools needed to create these and especially enthusiastic at using them, I will actively discourage them from doing it. The time needed is simply not worth it right now.
I would be happy and grateful if tools for creating these articles become easier to learn and use eventually, such that even the lower-budget, time-constrained researchers could afford to create them.
I won't deny the time commitment needed for a distill article is not trivial - it is far more work than a technical blog. But in terms of a pure tradeoff of time per publication, the calculus makes sense. Most of the work of research distillation and synthesis is already part of the research process, and writing a distill article is just a matter of putting it all of down on paper. Doing research is a far more time consuming and less predictable process.
This applies especially if you write the distillation targeted at the lab you want to hire you.
Hey let's be honest, most academics (that I know) still don't even use LaTeX (or refuse to do so). This is really cool, but requires way too many skills (in js/css3/html5/distill-extensions and node.js).
Personally, my team and I had really great experience with sharelatex.com, whom only I had knowledge about LaTeX. I liked that it's also opensource with a permissive license. I would rather host that on sandstorm.io the next time, or just pay for the comfort offered by overleaf.com (I've never seen such a beautiful colloborative LaTeX Editor).
• What about vendor lock-in?
• Can you export to LaTeX, Word or PDF?
• Can you selfhost it for your team or company?
What field? TeX is pretty much de rigueur in Math/CS/Physics graduate schools in the U.S.
I clearly don't expect people to do that much. I can only do that because i'm coming from web development, and very nice tools started to appear recently.
People in research needs a design framework like a set of templates for keynotes/PPT/JS/CSS (think about how much traction got bootstrap). Distill is doing an awesome jobs at showing the example of what you could do.
Maybe Distill could open-source the templates they use to build those blog post?
On the other hand, being able to write well and to create good interactive illustrations are valuable skills. Maybe we could incorporate these things into seminars or otherwise crowdsource the creation of e.g. individual figures?
So, I guess this will get distill get traction.
Google Research: https://research.googleblog.com/2017/03/distill-supporting-c...
YC Research: http://blog.ycombinator.com/distill-an-interactive-visual-jo...
Chris Olah: http://colah.github.io/posts/2017-03-Distill/
* In the last three weeks, we've had 80 outreach conversations with various stakeholders for Distill. The majority of these have been academic researchers. The response has been extremely positive.
* A number of ML faculty at Stanford / Berkeley / Toronto / Montreal are very excited and supportive of Distill.
* Distill's steering committee consists of recognized leaders in ML and data visualization.
* We've registered with the library of congres / CrossRef, dotting our "i"s and crossing our "t"s to be a serious journal. In some senses, we're more legitimate than some notable venues.
* The largest industry research groups institutionally support Distill.
My sense is that the academic community really wants to have something like this, if it can be done well. At the end of the day, we need to publish outstanding content and demonstrate that we're a high-quality venue.
Yet I don't see how this will readily support possibly cutting-edge work or new research in machine learning that does not have access to visualization development, or these forged connections to Distill to facilitate the development of these visualizations.
So it seems like a likely outcome is that Distill publishes content from well-regarded institutions and increases publicity for that work, to the detriment of a vast bulk of papers which do not have access to the visualization resources to develop Distill-ed versions of their work.
Furthermore, and this is a larger disciplinary issue, but it seems inherently this could end up spotlighting more CS-y machine learning vs statistical learning due to cultural differences between disciplines and differences in computational/web development background in grad students and researchers in both fields. Are there efforts to reach out to statistical associations as well?
Anyway, as a result, I don't see a reason why an alternative-format journal would necessarily fare any worse than conferences have in terms of becoming accepted, if the reviewing standards are high and if it attracts citations.
For the hiring side (more than the tenure side), to some extent, oddly enough, the first-order decision here is in Google's hands. A lot of CS hiring committees nowadays unofficially do a first cut sifting of resumes by typing candidates into Google Scholar and looking at their Google-computed h-index, so what "counts" is basically up to Google.
With new things, what you need is at least one person on the committee to fight and convince the others why this new thing is awesome. As someone who is now on some of these committees, I would put all my weight behind something like this should I encounter it (assuming of course it has the relevant quality).
As a side note who made the interface design for this?:
I am very interested in getting into this space from a design perspective.
(follow the thread)
Thank you for this effort. I'm a fan of your blog articles. A question regarding Distill: is it a journal like conventional journal to target new research? Or it is a journal for educational articles to explain old researches better?
I hope to contribute to an effort to better explain deep learning. I don't know if that is what distill is looking for?
How do I donate to this?
As you'd imagine though, it's really hard. Reading a two page research paper is a very different experience from reading a NYTimes or WSJ article. The information density is enormous, the vocabulary is very domain specific, and it can take days or weeks of re-reading and looking up terms to finally understand a paper.
I'm really excited about Distill, there's a lot of value in making research papers more accessible and interesting. I've noticed that the ML/AI field has been very pioneering about research publication process, some papers are now published with source code on GitHub and the authors answering questions on r/machinelearning. This seems like a really great next step, I hope other fields of science will break away from traditional journals and do the same.
I worked in clinical research in a past life and studies would be highly discounted if they couldn't be reproduced. A highly detailed methods section was key. Many ML papers I see tend to have incredibly formalized LaTeX+Greek obsessed methods section, but far short of anything to allow reproduction. Some ML papers, i swear must have run their parameter searches a 1000 times to overfit and magically achieve 99% AUC.
Worse, I actually have tons of spare GPU farm capacity i'd love to devote to re-producing research, tweaking, trying it on adjacent datasets, etc. But the effort to re-produce is too high for most papers.
It is also disappointing to see various input datasets strewn about individuals' personal homepages, and sometimes end up broken. Sometimes the "original" dataset is in a pickled form after having already gone through multiple upstream transformations. I hope Distill can instill some good best practices to the community.
It seems clear to me that the future will involve some kind of linking reproducibility to papers. If we want to find that future, we need a way for people to experiment with what a publication is.
* Dependencies like libraries are not specified anywhere.
* Dependencies on local code are not bundled.
* Dependencies on local data are not bundled.
* Underlying requirements like LLVM (which needs to be specifically 3.9.X for llvmlite in python as I discovered recently).
* Perhaps most dangerously, you can run the code sections out of order, and deleted sections will leave their variables around which can interfere with the run. I've been caught out by this in my own notebooks.
I really like jupyter notebooks, but I think some of the design decisions (correct for some ways of working) actively work against reproducible reports.
There was a recent writeup here:
> we were able to successfully execute only one of the ~25 notebooks that we downloaded.
> Technologies such as Jupyter and Docker present great opportunities to make digital research more reproducible, and authors who adopt them should be applauded.
I think they solve a different use case well, and forcing them into a workflow they weren't designed for may just result in both less useful workbooks and a poor experience.
Edit - To expand a little, jupyter notebooks are nice to mix code and descriptions, and in essence force people to release a certain amount of their code. But other than that they actually provide fewer of the guarantees that you want from things for reproducibility. And since the goals for reproducibility generally force more restrictions on how you work, I can see there being more issues for trying to match these different ways of working.
I don't see how there are any features which are useful for the goal of making things reproducible, and as such why people keep bringing them up as a solution.
The main steps would seem to be
1. Make sure the results used are not generated on "my machine" but on a specified base run somewhere else. Just like we don't take the unit test results I run locally as gospel.
2. Unique and versioned identifiers for code, base system and data.
3. Archived code and data.
4. An agreed on format in the output data to say where it came from (which references the identifier(s) for the code, base system used and input data)
Your output might be a rendered notebook, but the notebook itself is entirely orthogonal to the process, as what a notebook provides is:
* A nice interface for entering the code
* A nice output format
* A neat way of mixing nicely written documentation along with the code
Will the papers published on Distill maintain transparency of the statistical process?
I see in the submission notes that articles are required to be a public GitHub repo, which is a positive indicator. Although the actual code itself does not seem to be a requirement.
Visualizations or algorithms described using code are much, much easier for me to understand and serve as a great starting point for unpacking the math explanations.
I would recommend picking up a book on Comp Sci or algorithms, even just a cursory reading helps a lot. CS is very much not just programming and it is heavily restricted by descriptions through code.
Of course that's not completely permanent, but would perhaps provide some more safety.
Should you plug that in to IFTTT, Zapier, or something to that extent, you hopefully then have a weekly feed.
Though I do agree, an option to signup to updates directly on the website would be much better ;)
Also, it is very likely that veterans in the field might think of this format as too verbose and too sugar coated, more appropriate for less math-savvy users and therefore not mainstream. Furthermore, I really feel TeX is irreplaceable unless you got all of its feature covered. All of the historic effort to replace TeX - even with bells and whistles of WYSIWYG editors - in research has failed and its important to learn from those failures. You will be surprised how many researchers insist on printing out the paper for reading even when they have access to tablets and PC.
Instead of being another peer reviewed journal, Distill could act as the following:
- platform to publish supplemental material and code
- platform to manage communication/issues post publication
- platform for readers to invite other readers for peer review and generate "front page" based on some sort of reviewer trust relationship.
- platform to host Python and MatLab code with web frontends without researchers having to learn new developer skills
- support pdf submissions but without all the eliteness of arxiv and using algorithms to create the "front page" based on some sort of peer reviewer rankings.
Above features are indeed sorely missing and Distill has good opportunity to become an "add-on" to current academic publishing systems as opposed to another peer reviewed journal.
And what about long-term stability/presence. Most top journals and their publishing houses (NPG, Elsevier, Springer) are likely to hang around for another decade (or two...), while I don't feel so sure about that for a product like GitHub. Maybe Distill is/will be officially backed (financially) by the industry names supporting it?
That being said, I'd love seeing this succeed, but there seems much to be done to get this really "off the ground" beyond being a (much?!) nicer GitXiv.
If you just apply the formulas anyways, you'll get an JIF of (6 citations)/(4 publications) = 1.5. Again, this number is really pessimistic because those publications are only a few months old and haven't had time to accumulate citations.
> And what about long-term stability/presence.
We aren't particularly tied to github besides it being convenient. Even if the journal died, keeping it up indefinitely would be very cheap.
More than that, we're looking into joining projects like LOCKSS to ensure preservation of the academic record.
> but there seems much to be done to get this really "off the ground" beyond being a (much?!) nicer GitXiv.
We've actually done a lot of the logistics needed to legitimize a journal. We've registered as a journal with the library of congress, joined CrossRef, and built infrastructure to integrate our metadata with the library system.
Of course, there's a lot more to do. But the biggest thing is to just publish great content and run Distill as a serious, high-quality venue.
Re. IF, sorry if my first post wasn't as as obvious as I thought it would be. I wasn't referring to how IF is calculated, much less to Distill's current IF. Rather, there are two big problems related to IF that Distill needs to "solve"; Not the how, but rather then when and who of IF:
Ad when: The egg and the hen problem. As colah3 wrote, Distill's IF will only become meaningful in two years. But if you have exciting research, you want that to be in an high-impact journal/venue now. So attracting good research as a new journal/venue is extremely difficult, and probably the one main reason why new journals fail (c.f. the number of new journals/venues and the mostly non-existent change in impact rankings of the "best" places to publish). However, if you can get private researchers in industry to publish in Distill, because they are not [so] "dependent" on IF, you might accumulate sufficient impact in the first two years to get to a nice score, that later makes Distill competitive to the various IEEE journals or JMLR.
Ad who: The even worse problem that (at least European, not sure about US) universities evaluate their researchers by looking up their Web of Science ranking/score. WoS in turn is controlled by Thomson Reuters (TR), who also decide which journals get ranked in WoS (and sell access to WoS to universities and governments - n/c...). If a journal is not "recognized" by WoS, the publication or its citations do not get counted by TR. Ergo, as a public researcher, your funding dries up and/or you don't get the promotions you need. For that reason alone, no researcher in public research will allow her/his students and postdocs to publish in a journal that is not indexed by TR/WoS. But again, you might get around that by behaving "like" arXiv at first, at least: Most journals now grudgingly accept that the work was first on arXiv before it got published in some high-impact journal or venue. And maybe there is even a chance that the publishing industry will have to accept Distill in their midst (i.e., index it in WoS) if some other industrial backers create enough pressure...
As might be clear from the above, I (and many researchers) am (are) fed up with the current publishing system, so I certainly hope a "self-hosted", free solution controlled by the public [researchers] one day will break the iron first the current (private) publishing houses exert over how research is managed and evaluated today. If Distill manages to keep itself independent from industry, but at the same time can use the political weight its current backing could bring, maybe this is a way to break this vicious cycle?
Because the blog post and title seems to be describing it as a "journal" intended to replace PDF publications, but the actual content appears to be more in the tutorial/survey category, e.g. "how to use t-SNE," etc. Is this intended to be a place to publish new research in the future, or is it meant more for enhanced "medium"-style blog posts?
Both are fine, I just find the dissonance between the announcement and the actual content a bit confusing.
I like the idea of having research being approachable for the non-scientist, and the more important question of whether there is a more efficient form (in terms of communicating new science between scientists) for research papers to take.
Is there any relevant work along this vector of thought that I should check out? Because I would really love to do some work on this.
I made an awesome list recently just for this topic: github.com/sp4ke/awesome-explorables
I am a UI-developer who has been wanting to learn ML forever.
I started working on
2. think bayes
3. UW data science @ scale w/ coursera
4. udacity car nano degree
I'm going to write some articles about what I learn and hopefully move into the ML field as a data engineer in 6 months. I figure I got into my current job with a visual portfolio of nicely designed css/js demos, maybe the same thing will work for AI.
(One of the members of our steering committee, Michael Nielsen, has a significant history advocating for open science. I think there's about a snowball's chance in hell he'd be involved if we weren't. :P )
> Diagrams and text are licensed under Creative Commons Attribution CC-BY 2.0, unless noted otherwise, with the source available on GitHub. The figures that have been reused from other sources don’t fall under this license and can be recognized by a note in their caption: “Figure from …”.
Ideally code and data would be unambiguously public domain (CC0-1.0) or under appropriate open source and open data licenses.
this is tres bien.
same for data sets?
If you use Wikipedia as an input, for example, your data is CC-By-SA, not CC-By.
"Distill articles must be released under the Creative Commons Attribution license."
With a little more flexibility to keep things private before publishing: "You can keep it private during the review process if you would like"
I agree that the DOI should be included in the BibTeX citation.
It's the only way I could get a working model of Caffe while understanding the data preparation steps. I've already retrofitted it to classify tumors.
Other than that, this looks, much, much easier to write than LaTex.
Having more research transparency is great for community of likes minds to learn from. A suggested addition is an section and team to lead a discussion ML ethics.
What a time to be alive!