Hacker News new | past | comments | ask | show | jobs | submit login
You Can’t Do Data Science in a GUI (dominodatalab.com)
259 points by gk1 on Apr 15, 2018 | hide | past | favorite | 169 comments

Sounds like the same argument people have against Wix and Squarespace - "you can't make a website in a GUI".

Yes, you can - but you'll be pretty limited. If you're a brick and mortar or service focused business, a website builder is great. If you rely deeply on a customized web experience, you need to do something custom.

Same with data science. You can get pretty far with some simple data analysis tool. If you need to go farther, then you need to build custom solutions.

I thought the argument that a command line gives you reproducibility in a way that a GUI doesn't was good.

Most of the things someone does in Photoshop don't have to be redone repeatedly. For system administration or I'd guess data science, a lot of things need to be redone regularly. Using the command is good for both doing that and getting in the mindset of doing that.

It needn't be this way: we /could/ have a well defined set of actions encompassed by an API and the GUI just allows one to tinker with the API, and we could have a detailed audit history of every API call that was made, and with which parameters, regardless of whether it originated from the GUI or from a script. This also means scripting support, of course. Finally you should be able to re-play a portion of history. This is approximately how photoshop works, as far as I remember. Perhaps it's slightly worse.

Recently I made a photoshop macro to crop border some documents in a standard way for our graphic designers.

It took me 5 minutes to produce a working proof of concept with an approximate workflow...

But it took me 2 more hours to figure out how to change numericals parameters to precisely set the initial selection box instead of defining it imprecisely by hand. The most frustrating part being that the GUI displayed perfectly theses parameters but offer no f..ing way to edit them. Export/re-import the macro to plain text for edition was not an easy option because of proprietary binary format.

GUI are nice but when it obfuscate scripting capabilities it’s just another way to bind you to a plateform by making you learn plateform specific skills to work around limitations instead of learning universal coding principles.

There are good GUI around however. Just have a look at QGIS project for instance, it can be used purely as a GUI but offer a lot of opportunities to input custom formulas when needed for small adjustment. Heavy scripting extensibility is also possible but more hidden from basic user.

(NB: for Photoshop macro, I wasn’t using the latest CC release so I don’t know if it’s still the case)

Sure but I think you example shows that even a gui that produces the nicest possible text file for an exported macro is going to leave one with two activities; using the gui to accomplish the task and editing the output macro to create a batch for a repeated activity.

The command line allows these two activities to be closer to one activity and so when you base your skill set on using the command line, you get both things and get easy switch between them. It seem clear that for the automating of little tasks, this is kind of necessary.

Have a look at QGIS if you can. For me they really achieved the perfect balance. Some task are a hundred time easier with GUI like refining layout and trying out graphical style. But you can use variable and formulaes pretty much everywhere to override manual control using data defined attributes.

And when you need to you can run the application without GUI for heavy scripting. But still the two approach are fully compatible so you can easily define a layout with the GUI and reuse it via CLI for instance.

Maybe it’s common and I’m just a goof but this software workflow really impress me.

It doesn't have to actually be CLI. It does need to be repeatable (by you) and reproducible (by others), and it does need to go through a known API.

Those requirements will tend to drive you towards a CLI solution as the easiest/best available, but if you could get those requirements satisfied in any other programmatic way, then I think you'd probably be okay.

But then the problem becomes, can you do programming from the GUI? I think that's actually a lot harder to do.

Those requirements will tend to drive you towards a CLI solution as the easiest/best available, but if you could get those requirements satisfied in any other programmatic way, then I think you'd probably be okay.

Theoretically you can do these some other way. In practice, the CLI is the only way that's remained. The thing with the CLI is it is quite easy to a new component to it whereas someone creating a little app for a little task as a GUI tends to create a "cul-de-sac", stovepipe, a program with not relation to any other program.

So third approach would be great but it doesn't seem to be getting any closer.

Programs like Nuke, Houdini and Touch Designer have been doing this for years specific to graphics - this is trying to be more generalized:


SikuliX is halfway between CLI and GUI: https://avleonov.com/2017/02/13/sikulix-the-last-chance-for-...

This is basically what I kinda like about STATA. It's a Data Science tool for researchers.

The thing is that all its menu commands are stored in history, in ".do" file that you can send to anyone to reproduce your steps.

You can also just use the CL "shortcut" to address these same commands. It's pretty nifty.

There is also a package for R called rcmdr that provides GUI for basic statistical analysis. All your actions, performed in GUI, are stored as an actual .R script. However, I don't personally use it so I can't elaborate more.

I thought the argument that a command line gives you reproducibility in a way that a GUI doesn't was good.

I fundamentally disagree with that. I work mainly we geo-data analysis and use programming, command line tools and GUI tools. And honestly setting up a data processing pipe line in a GUI like FME is much easier and more reproducible out of the box than whatever happens to be left over after I've been screwing around with a bunch of random command line tools for a couple of hours. The main thing you lose is some flexibility.

I did some remote desktop support work in the early 90's, around the time that everybody started shifting from command-prompt DOS to GUI windows. We went from saying "type in ipconfig and tell me what you see" to "ok, do you see a start button? Click on the start button. Do you see where it says 'control panel'? It's about two-thirds of the way up. Yes, click on that. With the left button. No, the left button. Ok, do you see something that says 'network'? Click on that." It became obvious that it was impossible to do this over the phone, so we ended up installing what would probably be called "spyware" today so that we could remote-access everybody's machines (I can't remember what the software was called, but it rarely worked as it was supposed to). I see the same problem surrounding documentation around GUIs vs. command-line/text-oriented tools; the documentation spends pages to describe what a simple command would spend one line to cover. Although there are some things that do make some sense in a graphical interface, I still believe that _most_ things are orders of magnitude more efficient if you strip away the graphics and boil them down to some minimal commands.

Whenever I start a new software project, one of the first things I insist on is that the design include a command line tool with the same functionality as the GUI. This usually begets a sensible API between the business logic and the shiny parts, which ends up allowing for a great deal of flexibility for the GUI.

I’ve been burned too often by the GUI first, top-down approach, where the software architecture evolves from the GUI. V1 gets shipped, then the UX designer gets bored and v2 has to have a totally different look and feel and you’re screwed because your software design is permanently tied to the original UI.

Text is a way to build an AST.

GUIs are a way to build an AST.

You can add, remove, transform those changes, and track that history. Music software, Unreal Engine's blueprints, and other node-based programming environments have been doing this, in some cases, for decades.

Saying that 'you can't do data science is a GUI' explains more about the speakers lack of understanding of GUIs than it does about their knowledge of data science.

Depending on the tools it also gives you composability that you don't get from GUI programs. GUI's don't have their pipe equivalent so the aim to let the user do as much as possible within a single tool.

Depends on your definition of "Gui". See my other comment for an example of "science" in Smalltalk.

Doesn't Photoshop have a macro system for reproducible commands, though? I'm not an expert user, but I've definitely seen they have some form of automation available.

And a history api - with a few settings, and making the psd file (not a flattened jpeg etc export) - photoshop gives similar information to a series of high frequency vcs commits.

Granted, since Photoshop itself is closed source (and on a subscription model) there's some very strong limits to scientific replication of a process.

But one could do something similar with gimp, additionally aided by python scripts.

So yeah, Photoshop bad; cli good isn't as clever as all that as a blanket statement (not implying anyone said exactly that; just making an observation).

I see the talk/article is about "data" science ; but the headline reminded me about an Alan Kay talk about teaching - where there's a clip of kids filming a fallen object and then juxtapositioning the video with a rendered sequence based on (v=at etc): whole video worth watching, but see the "16:17" link in the transcript ("Now, what if we want to look at this more closely?"):


Yes, Photoshop has macros (they call them “Actions”).

These days, Photoshop even has (multiple) JavaScript interpreters (including an instance of node.js they call “Generator”) built in to enable automation using an API that provides access to most (but not all) actions that can be executed via the GUI.

I have been working with these for a side project and the documentation isn’t great, but I have been able to get up and running relatively quickly.


That's definitely a problem for science in general (not necessarily "data science" although all science uses data). There are people who analyze biomedical science with R and those who use GUI tools like Prism and Partek. Not only are the GUI tools limited, but the results are basically unreproducible -- if somebody clicked on a button incorrectly or set some setting in a different way, it would be impossible to tell.

I had this discussion today, and that was my conclusion: repeatability is the great advantage of CLIs and text files over GUIs.

What nonsense is this? Loads of GUI tools have it built in

'save report settings'

This is a benefit of using Vim (versus GUI editors) - you have a full history of all the commands you did

What nonsense is this? Loads of GUI tools have it built in.

'save report settings'

It seems like about every five years or so, some higher-up in my organization pushes down some "enterprise" graphical logic builder program that's supposed to simplify the creation of "complex business rules" so that even non-programmers can maintain them. Inevitably they end up being some variation of a drag-and-drop logic builder where something that resembles a flowchart is created by dragging if statements and loop constructs out of a tool palette onto a blank canvas. Of course, this is presented as "revolutionary" every time (even though I've seen the same thing at least five times already), costs a fortune, and turns out to be worse than useless - what the "non-programmers" are able to produce using this thing is limited to what can fit on a single screen (the thing slows down so much that it's unusable if you go bigger than a screen), and impossible to debug in any way. Yet in spite of failure after failure, I have every confidence that I'll end up dealing with more than one other graphical "business logic" tool in my career.

> some higher-up in my organization pushes down some "enterprise" graphical logic builder program... where something that resembles a flowchart is created

It should be said that the people who want to put this tool in place are the least likely to use any sort of flow chart when putting out requirements. In fact I would say that the people who are most likely to buy this are also the most likely to use pantomime and postit notes to convey requirements.

As someone considering building a graphical builder of a kind you described - "dragging if statements and loop constructs out of a tool palette onto a blank canvas" - would you have any advice/suggestions on how to make it actually useful?

It's so true what you say about how this category of applications are often presented as revolutionary and prove to be limited, bloated, difficult to debug.. I'm thinking web page builders especially, but also various attempts at graphical programming. At the same time, there are in history some (more or less) successful examples, like HyperCard.

That is what happens when the logicbuildergui hits the commandlinefan

Is Excel not a GUI? Photoshop? Unity? You can get pretty damn far with those tools (if not all the way).

Can you? I have long been under the impression that Excel is to be avoided for all but the lightest, most cursory analyses.

From 2007:


From 2013:


I could post more but I would have to fire-up my old spreadsheet ;)

I opened your second link. The first issue is the classic floating point numbers have rounding errors problem, which as far as I know, every system suffers from. That isn't just an excel problem.

The problem is often related to floating point representation, that is true, but it's not correct to conclude "oh well, everything gets this wrong so I might as well use excel".

One issue with excel is that many of the built in functions and statistical measures are implemented in numerically naive ways (and presumably remain so for reasons of computation speed and backwards compatibility) so if you want to do robust analysis you have to avoid them entirely - at which point you are far better off with a language designed for this. This is particularly an issue with larger data sets, where accumulation errors can become acute. Excel also introduces additional error terms due to binary encoding.

By the way: it is misleading to think of "rounding error problems". Far better to think about it as "rounding properties"/"truncation properties" and the like, then realize that you can't (in general) write floating point operations as if they were utilizing real numbers and expect correct behavior. That doesn't mean correct behavior is not achievable.

I think Excel is a bit of both. It is pretty amazing what people do with Excel. It's probably one of the most effective software tools ever created.

It straddles this incredible balance between completely free-form input and structured data enabling very powerful functionality.

There is a reason VisiCalc was such a big deal when it first came out. This is a UI that is both intuitive to regular people and also incredibly powerful -- and one that takes little effort to learn. It's a sign of the kinds of things that are possible with computing (reducing the gap between what today we call "coders" and "users"). There are few really large efforts towards research in this area these days and we are poorer for it.

Here is a good paper from Alan Kay that relates: https://frameworker.files.wordpress.com/2008/05/alan-kay-com...

But there could be errors hiding anywhere. All you know is it looks correct.

Sure, but the cost to fix that issue, and the flexibility you have to give up to do so, is apparently not worth it most of the time, if we look at how people use the software.

Perhaps there is a systematic undervaluing of more polished, robust custom solutions by excel users across the world. That could be the case.

But there is also probably an under-supply of adequate custom solutions. I have seen comments over the years on HN from people who have had much success in consultanting gigs where they simply built custom tools to replace ad-hoc workflows and processes living in places like Excel.

All existing business processes tolerate high error rates. They have to, because anything that people do by hand will necessarily have a high error rate. So when programmers come to automate an existing business process, they often vastly overestimate the value of correctness at least in the short term: if the program does the wrong thing even 1% of the time, it's really not a problem, because the processes around this process will be built to tolerate that.

In the long term, correctness may become more valuable. An analogy: when factories first switched to electric power, they simply connected electric motors to the existing driveshafts used to distribute steam power around the factory, and only realised small improvements in productivity that way. But once factories were more fully converted to electric power, it became possible to rearrange machines to suit workflows (rather than being arranged around the driveshafts) and this lead to much bigger productivity gains.

> Sure, but the cost to fix that issue, and the flexibility you have to give up to do so, is apparently not worth it most of the time, if we look at how people use the software.

That's not true. People just don't know any better. Look around you and you'll see many people using the wrong tools. It doesn't mean they've made a rational decision to use those, it usually means they're not aware of a better alternative.

You don't get any better assurances on the cli.

What do you mean by "the cli"? Garbage in/garbage out always applies, that's not what I'm talking about. The point is errors in the code doing the processing. If it's written in a language you can read and understand the entire thing. You cannot read and understand an entire spreadsheet.

>You cannot read and understand an entire spreadsheet.

Why wouldn't you?

Except if you mean the internal implementation (by MS). But nobody reads the implementation of Pandas or R either...

I'm not talking about reading the underlying implementation, although that is, of course, another advantage of Python and R. I'm talking about reading the project. What's your process for reading a spreadsheet? Reading every single cell?

That's a good point. In a block of 10,000 cells that should all have the same formula (offset by 1, say) how do you know that they all actually have that same formula? I'm not aware of a way to check that that doesn't involve writing custom VBA.

A spreadsheet doesn't offer you a linear story that you can read. It's a strange amalgamation of computation and program.

I like iTunes 10 as a GUI.

In fact, I like it so much that I'm hoping to rewrite it in JavaScript so it can run in the browser. I also don't like what Apple did with iTunes 11 and 12.

1. SQL queries are Smart Playlists.

2. Summary statistics are shown on the bottom, and apply to the selection (if there is one), or the playlist (if nothing is selected).

3. It's object-oriented data. This is a valid AppleScript command:

tell application "iTunes" to return name of first track whose artist contains "Blink 182"

4. A browser view along the top to quickly see Genre, Artist, Album.

5. Nested playlist folders, including smart playlists.

6. Support for other data types. I wrote a script to generate m3u files to add virtual "Radio stations" to iTunes. Clicking those triggers a PHP script that usually opens my browser with a URL, but could technically do anything else. (OK, this is a hack - I'll document it if anyone wants to know).

Excel is fine, but it doesn't have the hierarchical structure that iTunes is so good at. It also mixes the data and the instructions.

And let us not forget that Powerpoint is turing complete!


> Is Excel not a GUI?

Yes and Excel is horrible for Data Science. You have hidden data and to be honest almost every complex spreadsheet has at least 2 errors in it. https://www.forbes.com/sites/salesforce/2014/09/13/sorry-spr...

Excel? Not in the traditional sense. It's more of a... spatial(?) programming language.

The problem is that those limitations prevent you from being able to do any real "data science".

The GUI comes with a whole host of built-in assumptions about the data and what it could tell you, and you're probably not smart enough to know what those are and how to correct for them.

You might as well just search for your keys at the base of a streetlight.

The amount of money flowing through the system that's being optimized might be less than the cost of a completely general data scientist trained to work with completely generalizable tooling. Many businesses are happy to be able to spend $30/month and a small slice of the time of a resident spreadsheet master in order to find an incomplete but useful list of optimizations.

Try KNIME. You can do data science with a GUI and not to be limited at all. Specially if this GUIs let you also execute Python and R.

Also, Knime and similar apps show a workflow way easier to understand at first sight. Specially for people with low programming skills.

Not to mention that you don't have to care about the environment and quirks of the packages and so on, although I recognize this is not really a problem in R, but I had with Python from my noob perspective of programming.

I do pretty complicated stuff with Knime, PostgreSQL and Apache Nifi. I make workflows almost on the fly, snap many operations to them, etc. I'd love to see this guys doing something similar in the same timeframe I do.

Well made GUIs like Knime has, make your life easier, this kind of articles sound a bit snob tbh. This statements are true for developing websites and many other fields, not really for treating data. It doesn't make much difference given the current tools.

Edit: I see people arguing Excel is fantastic. It's true, it's amazing but it doesn't keep track of what you do with the data. It's difficult to understand what's going on if there's something even a bit complicated, not to mention if people just copies and pastes stuff everywhere, which is common.

Sorry for my poor english.

I have used Knime for a few months, enough to get comfortable with it, and while some of what you say is true I have to disagree about it standing up to command line tools. Especially after having tried to go through some code written by others in it.

The main problem I have with it is in many cases the upstream data has to be available to a node in order to be allowed to open the node up and see the configuration. This has resulted in my spending a day or two updating and tweaking database queries on an old connection and fixing the resulting configuration errors just to get to one node and read the configuration. In this case I didn't care about the data itself, I just needed the config of that one node.

This would not have been necessary with a text-based tool, where I could just scroll down and read. Knime can be a powerful tool but for sharing work I think it's unnecessarily painful.

You are right about this. I suggest you to report it in the repos or forum, they are somewhat responsive to it and it definitely would be a plus for everyone.

You also have Orange from Biolab, although is way less powerful but it has some cool nodes and AFAIK it was possible to open up nodes config without connecting them to anything.

It's just more Gatekeeping BS. I'm sure they said the same thing about software i.e. "you have to code in command line to be a real programmer"

Its turtles, all the way down. All users of computers must deal with the abstraction handles and layers for which they are comfortable - and which produce results. Not just data-science people. Pretty much a user maxim.

No, you've missed two of the major points: reproducibility and scrutability.

Actually, I really like these two GUIs which build on the strengths of R by combining the data-exploratory tools that Hadley Wickham created with the visual appeal of ggplot2 (also Hadley!), plot.ly, etc;

1. https://exploratory.io/

2. Radiant: https://radiant-rstats.github.io/docs/index.html

Microsoft IDEAR looks good, too:


In addition, the automatic insights generated by Power BI are another example of how GUIs can help even the hardcore command-line ninja:


I love exploratory - it's an a really interesting place between GUI and code autocomplete on steroids,

I have not heard of the exploratory tools. I look forward to checking them out, thanks for the heads up.

I didn't watch the talk. So please let me know if I'm mistaken about this one particular point:

That the author seems to be warning of limitations of certain solutions, but generalising those limitations to GUIs as a whole.

This is wrong.

There is nothing inherent in a GUI that would make it unsuitable for coding. Code does not equal text. Code, can be represented in many ways. An AST is code. Text, in a certain syntax, would represent the same code. And so can a GUI.

Now, is there a GUI for general-purpose programming that I'd want to use today for production work? No.

Will there be one in the future? I believe so.

But people discounting coding GUIs left and right, just because they haven't seen a good example of it yet, only discourages others to explore it further. It's a self-fulfilling prophecy, to some degree.

Anyway, here is a text/GUI-based programming environment (single data / multiple representations) that you might want to play with: luna-lang.org

> But people discounting coding GUIs left and right, just because they haven't seen a good example of it yet, only discourages others to explore it further. It's a self-fulfilling prophecy, to some degree

And there are good examples from the past, too. History is important and is often ignored by the wider programming community. Personal computing was more or less invented in GUI programming systems. There are environments and architectures from the 1970s that are highly relevant today but that are ignored or downplayed in the mainstream. Such an attitude will stifle progress for sure.

> But people discounting coding GUIs left and right, just because they haven't seen a good example of it yet

It's not just because I haven't seen a good example, it's because we have 50 years of bad examples. When something has been tried and failed by so many people for such a length of time you have to at least consider the possibility that the idea is just a fundamentally bad one.

I think I'm more likely to see flying cars in my life time then a decent GUI for general purpose programming. The problem with both is that they are fundamentally flawed.

Arguably, modern IDEs are decent GUIs for general purpose programming. Comparted to programming by editing files in a bash shell, they provide lots of visual tools (autocomplete, debugging) to track the dependencies between objects in the code, which is what a graphical interface provides over independent files.

Yes, healthy skepticism is good, but just because code (i.e. text files) is often the most _powerful_ or _flexible_ tool, doesn't mean it's always the best tool.

We (programmers) are notoriously bad at advancing the tools in our field.

For a brief history of this, watch "The Future of Programming" talk by Bret Victor:


>We (programmers) are notoriously bad at advancing the tools in our field.

I think the tools of our trade have advanced tremendously. Visual Studio for example is an amazing experience for C# programmers, one that most languages don't have. And this is in text tools.

Programmers know that text is the most powerful and flexible, which is why we advance those tools that help in working with text.

GUI tools are good for people who only want to do something every once in a while. Something they don't need to repeat. Where there's a simple recipe for it. And, yes, programmers don't do that much to advance these, because they have no use for them themselves.

That guy gets an a+ for presentation but I couldn't find much to agree with him on.

He talks about code being linear lines of text as though that's a bad thing. We've pretty much been stuck with this as state of the art in our writing systems for thousands of years, what would be your reaction if I suggested everyone should watch videos instead of read books? It's a flexible and easy way to represent a program that no other tool has come close to.

> We (programmers) are notoriously bad at advancing the tools in our field.

We've been trying to automate ourselves out of jobs for the entirety of the history of the industry yet programmers are in more demand than ever. Everyone wants to work on interesting problems and creating inner platforms is far more interesting than writing boring business logic. Yet for all our efforts we've barely progressed since the 70's, why do you think that is?

> What would be your reaction if I suggested everyone should watch videos instead of read books?

Videos are just another useful tool for learning; they don't obviate the need for books, but they're better at conveying some ideas/information than books alone.

Just like videos and books aren't mutually-exclusive tools for learning, graphical tools and textfiles aren't mutually-exclusive tools for building programs.

OK, I could write and execute SQL in a terminal. But I'd rather use MySQL Workbench. I get easy management of tables, views, etc. And flexible display of results. Why would you not want the GUI?

In the article, "GUI" seems to be used as an opposite to "programming". Analogy would be SQL versus query wizard dialogue.

I think you're right that the author is using "GUI" as the opposite of programming. I think he's correct with respect to nearly every GUI I've ever seen, but incorrect with respect to what a GUI could do in principle.

There's no reason that your wizard dialogue couldn't be exactly as expressive as SQL. In principle you could make a wizard that just built an arbitrary SQL query and ran it.

I said "nearly" before, the reason is graphical programming languages, for example unreal engine's blueprints. These are a "gui", and allow general purpose programming. One could imagine a tool with a similar style of programming, that extends them with inline data visualization and other tools, that would both undoubtedly be a GUI, and have all the nice features of programming.

I think the better compromise is probably something like jupyter notebooks, but that doesn't mean a GUI couldn't do it. And maybe a better GUI exists that I just haven't managed to imagine.

OK, but consider MySQL Workbench. One types SQL in a query. Or loads a saved query. And there's an execute button. But unlike working in terminal, the SQL is still on screen after a failed run. And there's red markup pointing to errors.

That's the advantage of the GUI.

The environment used as an example in the article is RStudio which by your definition of GUI would be a GUI (i.e. if MySQL Workbench is a GUI, so is RStudio). I think you and the author aren't in disagreement per se, you're just using the term differently -- when he talks about a non-GUI, he means a tool which executes code that you type into it, and you mean a terminal.

OK, then call it "menu-driven" or whatever. There's nothing intrinsically GUI about menus.

I would use a Jupyter notebook for data science. I don't think it's a GUI in the sense that the author intended.

In that case, IDEs would also be GUIs.

Well, IDEs are GUIs, though also not in the sense that the author intended. The problem seems to be a lack of a good term to clearly describe the subset of GUIs that the author wants to talk about.

How about a RUI (Restricted User Interface).

Yes but you still for anything non trivial write the SQL, PL/SQl, TSQl or what have you its just easier to do that using a GIU tool than direct on the command line.

> Why would you not want the GUI?

Elitist rhetoric and snobbery

He also suggests leveraging a programming language for benefits that include reproducibility, data provenance, and the ability to see how data analysis has evolved over time.

It's true that tracking data and data provenance is very important and hard in GUI tools.

But it's not really that much easier in code either.

To be specific about the kinds of problems here, I'm thinking of things like when errors are found in a dataset, new labels are introduced, or you want multiple splits on the data, but you still want the old version to check your metrics against.

Yes, you can do things like versioned directories for different data versions (although this tend to break when you are talking about TBs of data).

Or you can try using traditional version control tools, but that involves switching between code and your version control tool.

Or you can try transformation orientated programming, where you keep the original version of the data and then always transform it to get to the new version. This is slow on large data and fails when new information is introduced.

Also, normal version control doesn't work well with modelling code, because you want to use both the old and new versions of the code simultaneously.

Greg Brockman talked about this exact problem in the OpenAI/Ycombinator podcast.

This is a hard problem to solve - not sure who is working on it.

I’m not quite sure what he’s thinking of as a “GUI” in this context. What is R Studio? Are there actually people that use “GUI’s” for data science? Seems to me everyone is using R or Jupyter notebooks, or plain scripting

He's talking about programs where you do "work" by point and click and then hit save. R studio is just an IDE with a graphical terminal. All the "work" must be done in code.

Yes, Tableau is used a lot in life sciences.

Tableau is the biggest. But there are hundreds of other BI tools that are being pushed hard by vendors on the premise that they can replace code.

This appeals to mangers because it’s hard to recruit people who know R and SAS.

Some of these tools are better than others, but none let you do everything you can do in R.

exactly. the way I understood it, if you’re interfacing with a computer monitor instead of shuffling around magnetic bits on the disk by sending electric impulses directly to the write head, you’re using a graphical interface are you not?

Pretty sure that's not how most people define GUI. You seem to be defining an operating system.

Nah. Technically a graphical interface is an interface that is facilitated through the use of graphics. Not all computer interfaces are graphical. Some interfaces are audio/voice (such as Alexa and Siri, Screen readers for the blind). Etc.

I think the point I was making is that “what is a GUI” is not objective. And its colloquial definition may also evolve over time. The fundamental question right now is, is text graphical? Some would say no. I’d say yes. What about code highlighting? Code hinting? The buttons on your text editor? aren’t those graphical? Of course they are. I’d argue that they are the best GUI for crafting custom computer instructions.

In a sense, but GUI has always meant some kind of windowing/icon based form of navigation and running commands from menus or other mouse-based events, as opposed to using some kind of shell where you're typing in text-based commands.

And IDE would straddle both worlds.

You can use a GUI for the following:

1. Making a video game of a particular type (racing game, shoot em up, 3D shooter etc)

2. Do an analysis of a particular type (regression, ARMA analysis)

3. Make a web page of a particular type (landing page, agency page, personal blog)

What you cannot do is use a GUI to do a new kind of thing that the maker of the GUI had not considered.

And that is why you can't use it to do any kind of "science" which involves experimenting to see if this new technique will work.

Phooey. This comes off as unnecessary gate-keeping. How does anyone do science, ever? They use tools for measurement, tools for analysis, tools for record keeping...all of these functions can be captured within a GUI. Doing science doesn't ever require you to be doing something "new" or innovative... in fact, most of the time you are applying age-old techniques. The only thing new is the problem, and that's not dependent on what types of tools you are using to solve it.

I think you misunderstood me.

You can, of course, do science with graphical tools.

For instance, excel is a "graphical tool", and a lot of scientists use it.

But you can't do science on the field the tool is solving.

So you can use excel for biology, but data science is trying to trying to experiment on the data technique itself.

Practically speaking, this is true. Graphical environments typically snap together predefined (or at least minimally configurable) modules of code, Lego-style.

Theoretically speaking, of course, all Turing-complete languages are equivalent. You merely need a graphical environment that is expressive enough, i.e. one that can do conditional branching and store items in memory.

Well, I have one issue/question about this statement.

Turing-completeness would be a measure of computational equivalence, but that -- and I don't know if this is the right way to phrase it -- that is not the same as "presentational" equivalence.

A modern computer program does computations, but also displays them a certain way and TM's are silent on how you display the results of the computation. (To drive the point home, under no circumstances could Turing's ticker tape machine ever light up a single pixel on a screen.)

So, if you had an infinitely fast actual Turing Machine with infinite memory, you could tell me where the video game should go, and how the web page should render and what the regression visualization would look like, but not actually show it to me.

Ordinarily this is a pedantic argument to make but in the case of a GUI it is not. The GUI has to actually encode literally an infinite number of options to display the data in addition to computing it. This is particularly important for video games for instance.

So, I think you could have a theoretically complete Turing Machine and still not do the things that a modern computing language does.

Maybe the title should have been I can't do Data Science in a GUI. There is value in making high-level tools more accessible (and in the process gutting them of some of their power). But maybe not for this author.

Maybe you could do a "I can do data science in a GUI" talk and amaze us all (addressing all points the OP made). Hadley has been working for years on making data science programming with R accessible: development of expressive domain-specific APIs, books freely available online, etc.

If the GUI had a terminal built in, then yes you can. The mistake is to get rid of the terminal altogether instead of trying to augment it.

No, like other commenters you miss the points about reproducibility and scrutability. This isn't just about the tool being efficient and precise for the user, it's about tracking exactly what you've done in your process and nothing beats code for that.

Using a GUI doesn’t mean there is no code. SAS JMP lets you work on your data visually yet you still have code that defines you tables and plots, and you can easily work with git or hg.

JMP was precisely my counterargument to this whole piece. JMP is an excellent data science tool when used correctly and allows for reproducibility. You can even call R from JMP if you need to do something that is easier in R.

Further, as others have commented, exploratory data analysis is much easier in an environment like JMP than it is in R - when you're just playing around with the data and trying to get a sense of it, it's much easier to make quick approximate graphs in a GUI than in a command line.

Also Power BI's Power Query, Excel macro recorder, graphical SQL tools, etc

Yeah you're missing my point, just cause you're programming in a browser doesn't mean you can't have access to GNU toolchain, VIM / Emacs, git, ...

Agree. Thats what makes jupyter lab in the cloud very interesting.

One amazing GUI/visualization tool I use everyday for investigating computation fluid dynamics simulations is ParaView[0]. At the very least, it's a opensource GUI built on top of VTK used for visualizing CFD results, developed by the incredible developers at Kitware (same company that develops CMake). Under the hood, it allows you to separate your client from a {data,rendering} server so that you can visualize large datasets from a small laptop utilizing a client/server model.

Besides just data investigation using the GUI, all visualization workflows can be automated either through their python wrapper, or directly through the C++ API. The fact that you can automate the slicing/dicing and post-processing of CFD results on a server remotely using Python still blows my mind.

I wouldn't normally associate CFD with data science, but in a way, analyzing CFD results is starting to require the kind of scale of big data, and can certainly be done with the help of a GUI.

[0] www.paraview.org

The talk title is quite provocative but the material discussed not so much. It is true that core data science is iterative, needs to reproducible and more recently explainable. Can you do everything in GUI? It really depends on where you see it from. Early data scientists were programmers before so they loved to code and built tooling around it. While code is still dominant we also see the rise of UI-centric tools - these allow you to build ML pipelines by snapping blocks together. I feel are chasing a "different" type of data scientist. The term data scientist itself has become quite broad.

Code, CLI gives data scientists infinite flexibility but setup, management etc. is a challenge. GUI provides very less flexibility but you will have output fast - works for simpler problems IMHO.

GUI or not data science has to move to cloud-based tools. Whether you write code in browser or CLI on local machine is matter of choice.

> GUI or not data science has to move to cloud-based tools.

This is a pretty odd statement. Care to explain why?

Smaller datasets will work on laptop/desktop. For DL work with large datasets you need to build a GPU workstation. Moreover there setting up environment and dependencies on different hardware setups is not straight forward.

Cloud provides the flexibility of choosing the hardware, many open source projects allow you to manage your dependencies and setup better. From 0 to something, cloud is better than custom.

For quite many companies their whole "big data" dataset is small enough to be processed by a single beefy machine. In that case it's far more cost effective to simply plug in $1000 worth of extra ram in a workstation rather than spend some extra engineer-hours to do it remotely.

Couldn't agree more on the data size. In most cases beefy machine work. Would on-demand (cloud) make it simple?

Also beefy machine works for training jobs. But we need to deploy the models too.

With the many container solutions available today it's incredibly easy to move from dev to prod. You don't need to pay for a prod environment to do your development just to avoid ever having to migrate.

Yes, it is incredibly easy, except when you upgrade to tensorflow 1.6 and it fails with [a cuda error][1] and after couple of sleepless nights you realize nvidia has deleted the docker image of cuda version 70xx from dockerhub and you need to find the right commit that works from their git repo and build everything yourself.

[1]: https://github.com/tensorflow/tensorflow/issues/17566

So if you struggle to "setting up environment and dependencies on different hardware setups" then maybe Data science and technical programming is not for you.

I wish this view was held more widely. Unfortunately the vast majority of "data science" jobs out there will not ask for any such technical ability.


Here's a version of the video with better sound.

I never understood why people record in stereo when all is happening is people talking. Stereo is awful for that.

Hmm. I can, and do okay by it. But I do enjoy using the command line and remain unimpressed with many of R's more "unique"* packages not found in recent Anaconda distributions or in Julia, so perhaps I'm not this blog's target audience.

* R packages with low use tend to have highly variable quality. YMMV

The criticisms of the talk are valid, but overlook probably the biggest counterexample to the (cool) tools presented here:


Its dominating many areas of data analysis, yet the terms of service prohibit any scripting or API access to the formation of the XML that compose the .twb workbooks. The user is limited only to GUI clicking, none of which can be stored.

So I think its fair to say "You Can't Do Data Science in Tableau," even though its a useful tool and well implemented as a reporting tool. To me it seems a bit of a trap to build proficiency with Tableau and get locked into a closed enterprise product, I'm seeing a lot of data engineers at my company getting sucked into what is essentially a "tableau developer" role.

Data Scientists need to understand that speed trumps statistical rigor for basically 95% of the questions people have. GUIs bring speed.

Yes there is something fundamentally limiting with GUIs that I struggle with trying to express in a comperhensive way to people.

You see with almost any task. I see it when working with IDEs e.g. You got complex project management and there is always something you need to do which the designers of the IDE never had in mind.

However GUIs make a lot of things easier to do, so I think the ultimate solution is always something that allows you to mix and match GUI and programming/plain text solutions.

I think the problem today is that we make these GUI behemoth tools, when it would have been better with a collection of much smaller tools with more dedicated usage.

Knime is, maybe, what you are looking for. Sad the server version isn't open source AFAIK.

Any thoughts on the R vs Python at 2/3rd of the page? I have the feeling (but no data) that the data science community is moving from R to Python, is that correct?

Yes. R is horrible.

R needlessly flaunts convention. Find a function that has the same name in numpy, Matlab, and julia, that function will have an infuriatingly different name in R.

I personally love R and think it is superior to Python for about 90% of data science related tasks.

That's the thing. R is excellent at the statistics (the science, the hypothesis testing, and the other tools - and CRAN has real gems), but completely unusable for data manipulation (cleaning, filtering, discovering/exploration, tinkering), and especially hostile for anyone who is used to real programming languages. (No data pipeline for you, no easy scripting for regular experiments/backtesting. No great APIs, etc.)

> completely unusable for data manipulation (cleaning, filtering, discovering/exploration, tinkering)

Surely you aren't serious. Data manipulation is one of the areas in which R is vastly superior to Python.

When I started my plotting app, Veusz, https://veusz.github.io/, I expected I would mostly be using a command line to drive it. As I improved the GUI, however I found I was very rarely using the command line. GUIs can be very good for exploratory investigation and plot manipulation. Veusz is scriptable, so you get the best of both worlds.

I agree with him completely, but I also know that most 'Data Scientists' are just people who know bit of excel and have a head for numbers.

I think we just need to be realistic, if your job is to collate sales results every month and create a presentation about something interesting the data, your aren't really a scientist and you can probably just use Excel.

I have run into "Data Scientists" in the past that use exclusively Excel and maybe some outstandingly awful SQL.

To be fair, any job involving data gets title-inflated to data scientist these days. It may not be their fault!

I have too, and said "sorry" when they asked me how to deal with a 30k+ lines excel file, as my only answer to them was "learn to program", and that didn't feel like a very helpful thing to say (I still did say it, in the least condescending way I could manage)

Building prototypes in a GUI like RapidMiner is substantially faster than in R or Python

I don't understand why the down votes? It could be true for him, while it could be the opposite way for others. It's all about your perspective. I have said it before, but saying again:

Now imagine an alternate universe where there are no tools like photoshop/illustrator invented (No GUI or mouse based operating environments), we would have still created art through command line. Perhaps it would be a sophisticated version of SVG that mostly people with development skills would be producing with the input of designers who would be occasionally checking the design and giving inputs. Now this process have decades of tooling to make things better, like the same arguments we have on repeatability through various tools like macros etc. Now imagine someone coming up with the idea of a basic version of photoshop, we will most probably dismiss that idea. Very few mainstream programmers would adopt such a tool (Enterprises would, as seen with RapidMiner). That doesn't mean one day that would evolve into the Photoshop that we see today and would totally eliminate development effort in producing art.

p.s. edited for typos.

> GUIs deliberately obfuscate the process as you can only do what the GUI inventors want you to do.

One of the problems with programming languages that begin with the letter "p" is that they can be used to create viruses.

I was expecting Quantrix

Wow. And what about https://beta.observablehq.com/ by Mike Bostock?

"You Can't Do Data Science Effectively in a GUI" would have been a more honest title, as that is the crux of the speaker's talk.

Thanks for the links to videos; the very light gray very thin letters on white skate the edge of legibility.

Only having read the summary, I wonder if he considers a digital notebook (like Jupyter) to be a GUI.

Notebook is code.

Hadley is one of the driving forces behind R Notebook.

SPSS or Stata are GUIs that help to write a reproducible code... are we really arguing about that in 2018? Why so much élitist crap?

And btw - more general question - why computer scientists of the 70s (SPSS is from 1968) had much better intuitions of what user needed that computer scientist of today?

True, but you cannot do data science without a GUI.

You can do anything with the right GUI... So what he means is that there is no proper GUI available?

The article doesn't seem to address the question in the title. Why would they be mutually exclusive?

The submitted article was https://blog.dominodatalab.com/data-scientist-programmer-mut... with the title "Data Scientist? Programmer? Are They Mutually Exclusive?"

Since it summarizes a talk, we changed the URL to that of the talk, in keeping with the HN guidelines' request for original sources (https://news.ycombinator.com/newsguidelines.html).

So, it really isn't an improvement.

Anyone can skim the blog and get the gist of a 75 minute talk-a-thon in seconds, and move right along.

The presentation is just beating a dead horse about text expressions edited ina text editor and/or executed at the command line, and interpretted by R, vs. button clicking in R, in a nice IDE that's comfortable and approachable for non-programmers.

E.G. le tooling debate du jour, oui oui, monsieur...

OK, we'll change back to the text post but keep the title from the original talk. Maybe that will satisfy everybody...

In the video at the very start he mentions that the title of the talk is provocative, but when it's used as a title for posting on a site like this, it can obviously backfire a bit.

In the Q&A at the end, the very last question asked (regarding visual pipeline GUI's or something like that) was the content probably most directly related to the title.

What I think is interesting is that even though he's largely correct, in the real world the ease of use of approachable UI's, even if built on top of objectively substandard application cores, very often wins the race, at least until the next "shift" occurs. This has saved Microsoft's bacon more than once in history.

Did you watch all of Hadley's video? You might get the title more if you saw/see the whole talk :)

Also, the article does state what Hadley's take on the question is: "As Wickham defines data science as “the process by which data becomes understanding, knowledge, and insight”, he advocates using data science tools where value is gained from iteration, surprise, reproducibility, and scalability. In particular, he argues that being a data scientist and being programmer are not mutually exclusive and that using a programming language helps data scientists towards understanding the real signal within their data. "

thank you!

Cool, I'll just watch a 75 minute video so I understand a poorly written title, that's reasonable.

It's a bit baitey.

I bet the female/male ratio is higher for users of Excel than for R or Python. A majority of programmers are men, but the sex ratio of white collar workers (a large fraction of whom use spreadsheets) is much closer to parity.

There is often discussion about how to have more women in field X. Opening up field X to non-programmers is one way to do so.

I was once told not to change my source code purely to make my unit test pass. Changing the field to push the ratio towards 50 50 seems similar. I think the real question is: how do we get more females to want to learn to program.

100% this. If people want to push diversity initiatives, go for it, but I think there's no way that leads to real change without starting from the ground up i.e. cultivating childhood interest. Teaching about important women in CS helps break down the cultural expectation of programming as a male thing. In addition, although this is just speculation, I imagine workshops where the participants work to build something practical rather than purely "fun" would appeal to girls. A note-taking or organization or journal or instant messaging application would be great. Something you could use instead of pandering to stereotypes. Maybe I'm wrong about the workshop thing, but I stand by the childhood interest part.

Why is working with spreadsheets not considered programming?

This is really stupid. You can do anything in a GUI.

In general, you can do in a GUI what the programmer planned to enable you to do. As was mentioned previously in this thread, programmers come up with some sort of "graphical programming" frequently, recreating the https://en.wikipedia.org/wiki/Inner-platform_effect -- Which brings us back to the plain text language of programming and processing where all this is already available.

The deeper abstractions are lost on you, young padawan.

If this is criticism, I don't understand it.

Yes, yes, downvote me, let the hate flow through you.

Doesn't change that you can do anything in a gui :)

There's no such thing as data science. It's applied statistics. Why invent a silly name?

As evident in the comments, it means different things to different people, from straight up Excel to R etc.

IMHO, the best data science people I've dealt with have supreme knowledge of statistics and math, combined with the ability to dive deep into python, R, Tableau, viz tools, d3, whatever is needed to get the job done. They generally rely on data engineers to provide sane data, but can jump in as needed and write tools (or at least give guidance) to properly clean up the data themselves. In this context, the GUI vs. non-GUI seems kind of irrelevant.

Because "The person you hire into your business to be the crazy mad scientist in the corner who helps with decision making, business development, analysis, and occasionally reporting" is to long for most standard business cards and even most respectable email signatures.

I'd love to have a bcard with the official job title "mad german scientist". that would be groceartig!

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact