
Distill: a modern machine learning journal - jasikpark
http://distill.pub/about/
======
j2kun
I sure hope this catches on, but we should all be aware of the hurdles:

\- Little incentive for researchers to do this beyond their own good will.

\- Most ML researchers are bad writers, and it's unlikely that the editing
team will do the work needed (which is often a larger reorganization of a
paper and ideas) to improve clarity.

\- Producing great writing and clear, interactive figures, and managing an
ongoing github repo require nontrivial amounts of extra time, and researchers
already have strained time budgets.

\- It requires you to learn git, front-end web design, random javascript
libraries (I for one think d3 is a nuisance), exacerbating the time suck on
tangents to research.

Maybe you could convince researchers to contribute with prizes that aligned
with their university's goals. Just spitballing here, but maybe for each "top
paper" award, get a team together to further clarify the ideas for a public
audience, collaborate with the university and their department and some pop-
science writers, and get some serious publicity beyond academic circles. If
that doesn't convince a university administration that the work is worth the
lower publication count, what will?

In the worst case it'll be the miserable graduate students' jobs to implement
all these publication efforts, and they won't be able to spend time learning
how to do research.

~~~
colah3
You're absolutely right that this is a lot of work, and not many ML
researchers have all the skills needed for it.

In the short term, Distill's editorial assistance will help authors produce
outstanding papers, although they need to be willing to work as well.

In the longer-term, I'd like to explore match making between data
visualization people who would like to get into machine learning and machine
learning researchers publishing papers.

And in the very long term, I think the right solution is to add a new
component to the research ecosystem. Just like we we have people who
specialize as research engineers, theoreticians, and experimentalists, I'd
like to have a respected "research distiller" specialization. Eventually, I'd
like to try and start special grants for research groups to have someone
focused on this.

~~~
kowdermeister
I already know a guy who's doing this. Although he chose to publish very short
videos on various research (including many AI/ML), the concept and goal is
more or less the same.

Two Minute Papers on YouTube:

[https://www.youtube.com/user/keeroyz/videos](https://www.youtube.com/user/keeroyz/videos)

~~~
colah3
Karoly does lovely work! :)

------
colah3
Various announcements:

Google Research: [https://research.googleblog.com/2017/03/distill-
supporting-c...](https://research.googleblog.com/2017/03/distill-supporting-
clarity-in-machine.html)

DeepMind: [https://deepmind.com/blog/distill-communicating-science-
mach...](https://deepmind.com/blog/distill-communicating-science-machine-
learning/)

OpenAI: [https://openai.com/blog/Distill/](https://openai.com/blog/Distill/)

YC Research: [http://blog.ycombinator.com/distill-an-interactive-visual-
jo...](http://blog.ycombinator.com/distill-an-interactive-visual-journal-for-
machine-learning-research/)

Chris Olah:
[http://colah.github.io/posts/2017-03-Distill/](http://colah.github.io/posts/2017-03-Distill/)

~~~
curuinor
As I said in Rob's thingy, I hope you get the tenure committees and job
committees, because they don't have to respect it but they're the ones you
have to get to respect

~~~
colah3
All we can do is work hard to build academic support:

* In the last three weeks, we've had 80 outreach conversations with various stakeholders for Distill. The majority of these have been academic researchers. The response has been extremely positive.

* A number of ML faculty at Stanford / Berkeley / Toronto / Montreal are very excited and supportive of Distill.

* Distill's steering committee consists of recognized leaders in ML and data visualization.

* We've registered with the library of congres / CrossRef, dotting our "i"s and crossing our "t"s to be a serious journal. In some senses, we're more legitimate than some notable venues.

* The largest industry research groups institutionally support Distill.

My sense is that the academic community really wants to have something like
this, if it can be done well. At the end of the day, we need to publish
outstanding content and demonstrate that we're a high-quality venue.

~~~
arjunnarayan
Can you share a "behind the scenes" of what it took to get Distill off the
ground? You hint at dotting your "i"s and crossing your "t"s, but an explicit
manual would be useful. Other communities than just machine learning could
benefit from something like this, and if Distill succeeds in being taken
seriously by your research community, it would help to have a playbook in
which to replicate that success in other research communities as well.

------
choxi
I've been trying to read more primary source information, sort of as my own
way of combatting "fake news" but before that term was coined. There's a
learning curve to it, but I've found that reading S1 filings and Quarterly
Earnings Reports can be more enlightening than reading a news article on any
given company. Likewise, reading research papers on biology and deep learning
is _significantly_ more valuable than reading articles or educational content
on those topics.

As you'd imagine though, it's really hard. Reading a two page research paper
is a very different experience from reading a NYTimes or WSJ article. The
information density is enormous, the vocabulary is very domain specific, and
it can take days or weeks of re-reading and looking up terms to finally
understand a paper.

I'm really excited about Distill, there's a lot of value in making research
papers more accessible and interesting. I've noticed that the ML/AI field has
been very pioneering about research publication process, some papers are now
published with source code on GitHub and the authors answering questions on
r/machinelearning. This seems like a really great next step, I hope other
fields of science will break away from traditional journals and do the same.

------
TuringNYC
I don't want to undermine visualizations, they are awesome, but one of the big
problems I see with ML research is the lack of re-produceability. I know that
Google, Facebook and some others already share associated source repos, but it
should almost be mandatory when working with public benchmark datasets. Source
+ Docker Images would be even better.

I worked in clinical research in a past life and studies would be highly
discounted if they couldn't be reproduced. A highly detailed methods section
was key. Many ML papers I see tend to have incredibly formalized LaTeX+Greek
obsessed methods section, but far short of anything to allow reproduction.
Some ML papers, _i swear_ must have run their parameter searches a 1000 times
to overfit and magically achieve 99% AUC.

Worse, I actually have tons of spare GPU farm capacity i'd love to devote to
re-producing research, tweaking, trying it on adjacent datasets, etc. But the
effort to re-produce is too high for most papers.

It is also disappointing to see various input datasets strewn about
individuals' personal homepages, and sometimes end up broken. Sometimes the
"original" dataset is in a pickled form after having already gone through
multiple upstream transformations. I hope Distill can _instill_ some good best
practices to the community.

~~~
colah3
I think that having a venue that can publish non-traditional academic
artifacts is an important step for reproducibility, even if it isn't our
focus.

It seems clear to me that the future will involve some kind of linking
reproducibility to papers. If we want to find that future, we need a way for
people to experiment with what a publication is.

~~~
bpicolo
Jupyter notebooks are a big piece of solving ML reproducability, it feels
like.

~~~
IanCal
I see this a lot, but I disagree, at least in their current form. They miss a
variety of very key parts for reproducibility (which, to be fair, was not
their original goal).

* Dependencies like libraries are not specified anywhere.

* Dependencies on local code are not bundled.

* Dependencies on local _data_ are not bundled.

* Underlying requirements like LLVM (which needs to be specifically 3.9.X for llvmlite in python as I discovered recently).

* Perhaps most dangerously, you can run the code sections out of order, and deleted sections will leave their variables around which can interfere with the run. I've been caught out by this in my own notebooks.

I really like jupyter notebooks, but I think some of the design decisions
(correct for some ways of working) actively work against reproducible reports.

There was a recent writeup here:

> we were able to successfully execute only one of the ~25 notebooks that we
> downloaded.

[https://markwoodbridge.com/2017/03/05/jupyter-
reproducible-s...](https://markwoodbridge.com/2017/03/05/jupyter-reproducible-
science.html)

~~~
bpicolo
Right, "a part" was important. Looks like the authors of that writeup agree.

> Technologies such as Jupyter and Docker present great opportunities to make
> digital research more reproducible, and authors who adopt them should be
> applauded.

~~~
IanCal
I somewhat disagree that it's a big part or even really should be _a_ part of
the solution, I'm really not sure that these notebooks are the right approach
to making reproducible research. The conclusion there doesn't seem supported
by their findings, to me.

I think they solve a different use case well, and forcing them into a workflow
they weren't designed for may just result in both less useful workbooks and a
poor experience.

Edit - To expand a little, jupyter notebooks are nice to mix code and
descriptions, and in essence _force_ people to release a certain amount of
their code. But other than that they actually provide fewer of the guarantees
that you want from things for reproducibility. And since the goals for
reproducibility generally force more restrictions on how you work, I can see
there being more issues for trying to match these different ways of working.

I don't see how there are any features which are useful for the goal of making
things reproducible, and as such why people keep bringing them up as a
solution.

The main steps would seem to be

1\. Make sure the results used are not generated on "my machine" but on a
specified base run somewhere else. Just like we don't take the unit test
results I run locally as gospel.

2\. Unique and versioned identifiers for code, base system and data.

3\. Archived code and data.

4\. An agreed on format in the output data to say where it came from (which
references the identifier(s) for the code, base system used and input data)

Your output might be a rendered notebook, but the notebook itself is entirely
orthogonal to the process, as what a notebook provides is:

* A nice interface for entering the code

* A nice output format

* A neat way of mixing nicely written documentation along with the code

------
minimaxir
The announcements and About page indicate an emphasis on visuals and
presentation, which I apprI've. But when I think of "modern machine learning,"
I think of open-source and reproducibility (e.g. Jupyter notebooks).

Will the papers published on Distill maintain transparency of the statistical
process?

I see in the submission notes that articles are required to be a public GitHub
repo, which is a positive indicator. Although the actual code itself does not
seem to be a requirement.

~~~
shancarter
I totally agree that this is very important. While it isn't currently our
primary focus, having a publishing platform that can accommodate a variety of
content types (including code and data) feels like a step in the right
direction.

------
Xeoncross
As a developer with a weaker background in mathematics, I face a language
barrier with many modern algorithms. After lots of research I can understand
and explain them in code, but I have no idea what your artistic-looking
MathXML means.

Visualizations or algorithms described using code are much, much easier for me
to understand and serve as a great starting point for unpacking the math
explanations.

~~~
runemopar
I understand where you're coming from and you raise a valid point, but the
ML/AI is heavily academic and oriented around research. The target audience is
people with a very strong math background and the necessary context.

I would recommend picking up a book on Comp Sci or algorithms, even just a
cursory reading helps a lot. CS is very much not just programming and it is
heavily restricted by descriptions through code.

------
blinry
Shameless self-plug: If you like interactive explanations, check out
[http://explorableexplanations.com/](http://explorableexplanations.com/) and
the explorables subreddit:
[https://www.reddit.com/r/explorables/](https://www.reddit.com/r/explorables/)

------
cing
Is there any concern about a web-native journal being less "future-proof"?
I've come across quite a few interactive learning demonstrations in Flash/Java
that no longer work.

~~~
shancarter
This is a high-priority for us. By focusing on web-standards and avoiding
proprietary plugins we're pretty confident that the content will be future-
proof.

~~~
IanCal
Something that could help is perhaps a choice that examples should work in
(e.g.) Firefox recent.x on ubuntu, then provide a VM and archived version of
firefox. Put it on a platform that archives things with C/LOCKSS and get a
doi, then although you're not expecting people to use it on a daily basis,
it'd cover several "worst case" kind of scenarios.

Of course that's not completely permanent, but would perhaps provide some more
safety.

------
dang
YC Research's (and longtime HNer!) michael_nielsen wrote an announcement here:
[http://blog.ycombinator.com/distill-an-interactive-visual-
jo...](http://blog.ycombinator.com/distill-an-interactive-visual-journal-for-
machine-learning-research/). Hopefully he'll participate in the discussion
too.

------
rememberlenny
I wish there was a way to subscribe to a weekly email related to this.

~~~
blackRust
There does seem to be an RSS feed:
[http://distill.pub/rss.xml](http://distill.pub/rss.xml) Although it is not
advertised on the website (I did view-source to find it).

Should you plug that in to IFTTT, Zapier, or something to that extent, you
hopefully then have a weekly feed.

Though I do agree, an option to signup to updates directly on the website
would be much better ;)

------
sytelus
This is great but it would have been even better if Distill was designed to
play well with the current system. Vast majority of researchers are focused on
publishing at various conferences with strict deadlines. Even if they had all
the skillsets and time to produce these beautiful illustrations, I highly
doubt this will change.

Also, it is very likely that veterans in the field might think of this format
as too verbose and too sugar coated, more appropriate for less math-savvy
users and therefore not mainstream. Furthermore, I really feel TeX is
irreplaceable unless you got all of its feature covered. All of the historic
effort to replace TeX - even with bells and whistles of WYSIWYG editors - in
research has failed and its important to learn from those failures. You will
be surprised how many researchers insist on printing out the paper for reading
even when they have access to tablets and PC.

Instead of being another peer reviewed journal, Distill could act as the
following:

\- platform to publish supplemental material and code

\- platform to manage communication/issues post publication

\- platform for readers to invite other readers for peer review and generate
"front page" based on some sort of reviewer trust relationship.

\- platform to host Python and MatLab code with web frontends without
researchers having to learn new developer skills

\- support pdf submissions but without all the eliteness of arxiv and using
algorithms to create the "front page" based on some sort of peer reviewer
rankings.

Above features are indeed sorely missing and Distill has good opportunity to
become an "add-on" to current academic publishing systems as opposed to
another peer reviewed journal.

------
transcranial
This is really exciting! Chris et al: have you guys seen Keras.js
([https://github.com/transcranial/keras-
js](https://github.com/transcranial/keras-js))? It could probably be useful
for certain interactive visualizations or papers.

------
fnl
How does this provide IF ratings? Probably irrelevant for industry, but
publishing in academia is all about IF, no matter how bad and corrupt one
might think it is.

And what about long-term stability/presence. Most top journals and their
publishing houses (NPG, Elsevier, Springer) are likely to hang around for
another decade (or two...), while I don't feel so sure about that for a
product like GitHub. Maybe Distill is/will be officially backed (financially)
by the industry names supporting it?

That being said, I'd love seeing this succeed, but there seems much to be done
to get this really "off the ground" beyond being a (much?!) nicer GitXiv.

~~~
colah3
Our present JIF is undefined because we haven't existed for two years yet.

If you just apply the formulas anyways, you'll get an JIF of (6 citations)/(4
publications) = 1.5. Again, this number is really pessimistic because those
publications are only a few months old and haven't had time to accumulate
citations.

> And what about long-term stability/presence.

We aren't particularly tied to github besides it being convenient. Even if the
journal died, keeping it up indefinitely would be very cheap.

More than that, we're looking into joining projects like LOCKSS to ensure
preservation of the academic record.

> but there seems much to be done to get this really "off the ground" beyond
> being a (much?!) nicer GitXiv.

We've actually done a lot of the logistics needed to legitimize a journal.
We've registered as a journal with the library of congress, joined CrossRef,
and built infrastructure to integrate our metadata with the library system.

Of course, there's a lot more to do. But the biggest thing is to just publish
great content and run Distill as a serious, high-quality venue.

~~~
fnl
I for one am not so convinced GitHub is likely to be around for another decade
or two. But whatever, let's just pretend that Distill can always find a _free_
hosting solution, that is not so unlikely. Maybe that's good enough?

Re. IF, sorry if my first post wasn't as as obvious as I thought it would be.
I wasn't referring to how IF is calculated, much less to Distill's current IF.
Rather, there are two big problems related to IF that Distill needs to
"solve"; Not the _how_ , but rather then _when_ and _who_ of IF:

Ad _when_ : The egg and the hen problem. As colah3 wrote, Distill's IF will
only become meaningful in two years. But if you have exciting research, you
want that to be in an high-impact journal/venue _now_. So attracting good
research as a new journal/venue is extremely difficult, and probably the one
main reason why new journals fail (c.f. the number of new journals/venues and
the mostly non-existent change in impact rankings of the "best" places to
publish). However, if you can get private researchers in industry to publish
in Distill, because they are not [so] "dependent" on IF, you might accumulate
sufficient impact in the first two years to get to a nice score, that later
makes Distill competitive to the various IEEE journals or JMLR.

Ad _who_ : The even worse problem that (at least European, not sure about US)
universities evaluate their researchers by looking up their Web of Science
ranking/score. WoS in turn is controlled by Thomson Reuters (TR), who also
decide _which_ journals get ranked in WoS (and sell access to WoS to
universities and governments - n/c...). If a journal is not "recognized" by
WoS, the publication or its citations do not get counted by TR. Ergo, as a
public researcher, your funding dries up and/or you don't get the promotions
you need. For that reason alone, no researcher in public research will allow
her/his students and postdocs to publish in a journal that is not indexed by
TR/WoS. But again, you might get around that by behaving "like" arXiv at
first, at least: Most journals now grudgingly accept that the work was first
on arXiv before it got published in some high-impact journal or venue. And
maybe there is even a chance that the publishing industry will have to accept
Distill in their midst (i.e., index it in WoS) if some other industrial
backers create enough pressure...

As might be clear from the above, I (and many researchers) am (are) fed up
with the current publishing system, so I certainly hope a "self-hosted", free
solution controlled by the public [researchers] one day will break the iron
first the current (private) publishing houses exert over how research is
managed and evaluated today. If Distill manages to keep itself independent
from industry, but at the same time can use the political weight its current
backing could bring, maybe this is a way to break this vicious cycle?

------
radarsat1
While this is very nice, I'm a bit confused about the target. What kind of
material is intended to be published here in the future?

Because the blog post and title seems to be describing it as a "journal"
intended to replace PDF publications, but the actual content appears to be
more in the tutorial/survey category, e.g. "how to use t-SNE," etc. Is this
intended to be a place to publish _new_ research in the future, or is it meant
more for enhanced "medium"-style blog posts?

Both are fine, I just find the dissonance between the announcement and the
actual content a bit confusing.

------
chairmanwow
I feel like science publication in general could benefit from disruption of
the publishing model. I'm not sure that the toolkit that Distill has provided
is quite enough to totally change the paradigm, and it currently restricted to
only one field.

I like the idea of having research being approachable for the non-scientist,
and the more important question of whether there is a more efficient form (in
terms of communicating new science between scientists) for research papers to
take.

Is there any relevant work along this vector of thought that I should check
out? Because I would really love to do some work on this.

~~~
sp4ke
Yes, check everything made by Bret Victor and his explorable explanations.

I made an awesome list recently just for this topic: github.com/sp4ke/awesome-
explorables

------
ycHammer
Would saving jupyter notebooks as .html work? PS: I have published in all of
top-4 tier ML conferences but s __k at html /css/js. What is my pathway to
distill now? I, like every other researcher worth her/his name in salt is
always running behind clock when it comes to deadlines and lit to review. So,
yeah? Coaxing myself into investing time for css/html/js in lieu of picking up
more math tools seems criminal to me. Am I alone in this ?

------
mysore
Wow this comes with great timing!

I am a UI-developer who has been wanting to learn ML forever. I started
working on

1\. fast.ai 2\. think bayes 3\. UW data science @ scale w/ coursera 4\.
udacity car nano degree

I'm going to write some articles about what I learn and hopefully move into
the ML field as a data engineer in 6 months. I figure I got into my current
job with a visual portfolio of nicely designed css/js demos, maybe the same
thing will work for AI.

------
Old_Thrashbarg
I don't see it written explicitly; can anyone confirm that this journal is
fully open-access?

~~~
colah3
Yes. Everything is published under Creative Commons Attribution.

(One of the members of our steering committee, Michael Nielsen, has a
significant history advocating for open science. I think there's about a
snowball's chance in hell he'd be involved if we weren't. :P )

~~~
auvrw
> Everything is published under Creative Commons Attribution.

this is tres bien.

same for data sets?

~~~
rspeer
That would preclude most research data.

If you use Wikipedia as an input, for example, your data is CC-By-SA, not CC-
By.

------
JorgeGT
You should definitely assign a DOI to each article.

~~~
allenz
Distill does assign DOIs. There is a citation_doi meta tag in the page source,
and you can also find a complete list here:
[https://search.crossref.org/?q=Distill&publication=Distill](https://search.crossref.org/?q=Distill&publication=Distill)

I agree that the DOI should be included in the BibTeX citation.

~~~
JorgeGT
I see! Yes, this is something I miss a lot on Google Scholar (I have to go to
the article page to search for the DOI field). It would be nice to also
display the DOI link somewhere near the author list since it seems standard
practice, but in the citation section would be good as well.

------
EternalData
Looks very good (especially the team behind it!), but I wonder if there's a
discrete step down to where you make machine learning materials accessible to
the general public beyond data visualizations and clear writing. This will
certainly be a more interactive experience, but it seems to cater to those who
are "in-the-know" and require a bit more interactivity/clarity. It'd be nice
to discuss the format changes or the "TLDR" bot of machine learning that makes
machine learning research truly accessible to the general public.

------
fwx
This is amazing! My burning question - as has been pointed out in the thread,
the effort to produce a great article on Distill - generating interactive
figures, doing front end web dev etc. would require a lot of time and
resources on the part of the researchers. Is it possible to include within
Distill an option to connect researchers to willing-and-able developers in
those domains (for example, me) to help them get it done?

------
aabajian
I already have a nomination. The guy who wrote this blog post:

[http://adilmoujahid.com/posts/2016/06/introduction-deep-
lear...](http://adilmoujahid.com/posts/2016/06/introduction-deep-learning-
python-caffe/)

It's the only way I could get a working model of Caffe while understanding the
data preparation steps. I've already retrofitted it to classify tumors.

------
taliesinb
Great stuff! I'm a fan of what's gone up on distill so far. Question for colah
and co if they're still around: When does the first issue of the journal come
out (edit: looks like individual articles just get published when they get
published, n/m). Also, that "before/after" visualization of the gradient
descent convergence is intriguing -- where's it from?

~~~
gabrielgoh
Find out in a week!

------
blunte
I don't know jack about machine learning, but these illustrations are gorgeous
- simple, elegant, and aesthetically very pleasing.

------
wodenokoto
Looking at the how-to section[1] for creating distil articles, I fail to find
how to write math and some notes on how best to reference sections of the
document.

Other than that, this looks, much, much easier to write than LaTex.

[1] [http://distill.pub/guide/](http://distill.pub/guide/)

------
djabatt
It would be cool to see greater diversity of thinking on the about page.
perhaps the pub is designed for insiders.

Having more research transparency is great for community of likes minds to
learn from. A suggested addition is an section and team to lead a discussion
ML ethics.

------
good_vibes
I will definitely submit my first paper to Distill. It draws upon a few
different fields but the foundation is definitely machine learning.

What a time to be alive!

------
mastazi
r/MachineLearning discussion:

[https://www.reddit.com/r/MachineLearning/comments/60hy0t/the...](https://www.reddit.com/r/MachineLearning/comments/60hy0t/the_journal_distill_launches_today_in_a_nutshell/)

------
ycHammer
Anyone here has any idea if Jupyter notebook -> save as .html would do the
trick?

------
skynode
Hopefully this won't be another ResearchGate dressed in open source clothing.

