Hacker News new | past | comments | ask | show | jobs | submit login
Transform Data by Example [video] (microsoft.com)
363 points by gggggggg on May 17, 2017 | hide | past | web | favorite | 87 comments

You know what this reminds me of? Those trained neural-net things which, however many training examples you give it, always seem to find some way to “cheat” and not do what you want while still obeying all your training data correctly.

Something like this: Suppose we have a table of strings of digits, some including spaces, and we’d like to remove the spaces. From

  123 456
  345 678

Now, what happens if it encounters, say

Would the result be unchanged (as we would probably want), or would it “cheat” and remove the middle “7” character, giving “456890”?

This is why I want any ML device to be able to explain itself. It could train on your before-and-after examples and come up with a list of what it thinks you want it to do.

For your example, it could list:

    “Remove interior spaces from each item”
or it could say:

    “Remove the middle character from any 7-character strings to make them 6 characters in length”
You would be able to do something with that.

DataWrangler [0] (now productionized as Trifacta Wrangler [1]) does pretty much that. It gives you suggested lists of transformations such as "Cut from position 18-25 as the Year column", that you can chain together as your data cleaning pipeline.

[0]: http://vis.stanford.edu/wrangler/

[1]: https://www.trifacta.com/products/wrangler/

This add-on already does that. (Did nobody try it??) It shows in the pane a list of candidate transforms, seemingly ranked in some descending plausibility order. They have semi-readable names. You get to choose one to apply.

Neural nets are infamous for not doing this.

Learning algorithms that produce decision trees are usually used in this situation.

This might be a dumb question, but let's say that for whatever reason on a specific problem it's much easier to train a neural network that generalizes well than a decision tree. Why not train the network, then build an equivalent decision tree that just tries to reproduce the network's output? When building the tree from the network, overfitting would not be a concern. In fact, you'd want it to overfit.

You could even say that it only needs to approximately reproduce the output with some tunable error threshold, which might give you leeway for finding more comprehensible and simpler trees.

I think usually if a problem is more solvable by a neural network than a decision tree, there is an underlying reason. Neural nets and decision trees work in very different ways.

Take image classification as an example. CNNs can do it by finding nonlinear patterns that exist. A decision tree would have a very tough time doing it because the pixels have a complicated relationship with each other that defines what the image is of.

I think for something like the data transformations we're talking about a Neural Network would be pretty over kill. It looks like this feature in excel works by comparing the data to pre-defined formats, which is probably done by searching all known formats in a somewhat intelligent (not ai, just intelligent) way so that it's fast. Then it can output that type of data in whatever form you want.

Your comment gave me an interesting idea though: What if we put neural networks inside of decision trees?

> This might be a dumb question, but let's say that for whatever reason on a specific problem it's much easier to train a neural network that generalizes well than a decision tree. Why not train the network, then build an equivalent decision tree that just tries to reproduce the network's output? When building the tree from the network, overfitting would not be a concern. In fact, you'd want it to overfit.

You haven't fixed anything here. You've just encoded your training data in a neural net and then presented the same problem to the decision tree learner. Unless you're planning to transform your training data somehow?

I don't think I follow here. The goal of training the network isn't to encode the training data in the model, but rather to build a model that generalizes well. If the neural network has just memorized the training examples, then it overfit and really isn't useful in the real world.

I'm imagining a hypothetical example where generalization is easier to achieve with a neural network than with a decision tree using standard training techniques. Then a tree trained on the network might generalize better than a tree trained straight on the original data, with the additional benefit of being less of a black box than the network.

This solves the problem of interpretability. You can't interpret the weights of a neural network, but you can easily follow along a decision tree and see if it's doing what you want.

Actually that's somewhat less true for big decision trees. But the general point is that you can train interpretable models to mimic the output of uninterpretable black boxes.

The biggest issue is that decision trees only work for data with fixed inputs and outputs. Recurrent NNs work on a time series and possibly even have attention mechanisms.

No, it doesn't make sense. The training data (inputs and NN-predicted outputs) that you're feeding into the DT is at best the same as the training data (inputs and desired outputs) you had originally.

You can generate infinite training data with the NN by feeding in random inputs and seeing what outputs it gives. You can then train whatever model you want on it without concern for overfitting.

But more importantly, the decision tree will model the behavior of the NN, not necessarily the original data. Which is what you want, if your goal is to understand what function the NN has learned.

The point about infinite training data is potentially useful. The other one I still don't agree with. Your goal is only to understand the NN insofar as it models the original data. Any errors the NN is making are not worth learning about. So it would be better to train the understandable method (DT) on the original data.

>Any errors the NN is making are not worth learning about.

But that's the whole point of this method! To understand what errors the NN might be making. It's also quite possible the NN's errors aren't really errors, if there are mistakes or noise in the labels.

This technique has been called "dark knowledge" and is really interesting. See http://www.kdnuggets.com/2015/05/dark-knowledge-neural-netwo... They train much simpler models to get the same accuracy as much bigger models, just by copying the predictions of the bigger model on the same data. In fact you can get crazy results like this:

>When they omitted all examples of the digit 3 during the transfer training, the distilled net gets 98.6% of the test 3s correct even though 3 is a mythical digit it has never seen.

Ah, very interesting! I agree that would be useful. But I think this thread has ended up with a proposal very different from the one I started replying to.

> When building the tree from the network, overfitting would not be a concern.


> tunable error threshold, which might give you leeway for finding more comprehensible and simpler trees


However, my guess is you'll wind up doing only one of (a) having more accurate tree model than training it directory (b) improve the understand ability of your model significantly.

Which is why learning algorithms that produce decision trees are far smarter in the long run. Neural nets might eke out other benefits but there's a lot to be said about justifiable/accountable decisions.

Correct me if I'm wrong, but the only type of decision tree that is comparable to a NN in terms of performance is an ensemble of decision trees, and these are equally hard to interpret as NNs.

Except the big (if not the main) part of modern economics is all about "don't care about long run - that's the only smart strategy".

Except that's not so much a scientific theory but a justification for "quick bucks, everything else be damned".

It's theoretically possible for a neural net to do this; the network just needs to have the explanation as an output. I agree that decision trees would be more reliable and easier to train, but I'm not sure if hardcoding every feature is scalable.

How do you know that the explanation jives with the other outputs, though? It seems like a turtles-all-the-way-down situation, because now I want to see how it was properly introspective of its own decision making.

Also seems like it’s another magnitude of complexity in the neural net to have it not only train and learn on your inputs, but also train and learn on its own training and learning.

Neural networks don't do anything as sophisticated as self-referential introspection. They just fit the outputs you train them with. The training data you provide would have to include the desired explanations.

Consistency is enforced by the dataset, and also by the model. Both outputs would read from the same hidden layer--the one that encodes the desired transformation.

>How do you know that the explanation jives with the other outputs, though?

The third neural net would do the checking, obviously.

> the network just needs to have the explanation as an output

And how would you evaluate whether the explanation was correct or not?

You give it explanations as training data and it tries to predict them.

> This is why I want any ML device to be able to explain itself.

This is the problem of lacking explanatory mechanisms in ML.

Note that some techniques that are very out of vogue at the moment, such as Genetic Programming, are much better than neural nets in this regard.

IIRC you _can_ get this, but it's a huge algorithm that doesn't do things in a way that would probably make sense to a human. It would be amazing to be able to transform code into human language.

I was sloppy in my own examples. I’d be perfectly satisfied with an AST, or regex, or other non-English explanation. But something that I can audit is what I’m after. Otherwise this tool would never be trustworthy enough to let lose on billions of rows of data, with silent errors occurring throughout. (Well, I guess it depends on the nature and importance of the data. Cat photos, meh. Drug prescriptions? Ahh!)

Maybe also synthesize and suggest property based tests, by one specify also some invalid examples. Then these checks could be ran for each transformation. For instance:

- 123 456 = 123456 valid

- 1234567 = 123567 not valid (dropped 4)


- output may not contain whitespace

- no number characters may be dropped

- characters may not be reordered

Some ML systems (like decision trees) can give you a comprehenable way that they made the decision (would give you a list of if conditionals). Unfortunately many can't do this (random forests) . Having an AI that can explain itself in every situation why it does it also has to do with the underlying techniques. For instance random forests generate a random sample of the features and creates decision trees. So the explanation wouldn't be much useful to you.

Not true. Look up "partial dependence plots".

This is the stated goal of the Explainable AI initiative, (spearheaded afaik by DARPA, though Google tells me corporates have also began work on it). I hope it works out well because there's going to be a lot of AI code in the near future, and the thought of them all being inscrutable black boxes is pretty scary.

But, you know, if you saw something like, all your visible examples were like the strings

  123 456
  345 678
and the program replies with something like what you wrote: “Remove the middle character from any 7-character strings to make them 6 characters in length”, it would actually take a programmer’s mind to be able to envision why this might in some cases be wrong. Most people who are not programmers would, I think, see this as equivalent to “Remove interior spaces from each item”. I suspect that the skill required to choose an algorithm correctly is the exact same skill required to actually being a programmer.

All this then buys you is that you don’t have to remember the function names.

Yes, this system should do have an intermediate step of spiting out a checklist of clear rules and the user can select the best fit, saving the human the time it would take to search from a bloated dropdown of all possible rules.

MS had been experimenting with this for a while[1]. They even included this in Excel 2013 as "FlashFill". It does not use any NN/ML at all. It uses "program synthesis", which by definition can tell you exactly what "program" it has synthesized to convert you data. In fact in you example it would not cheat, rather leave the string unchanged as explained in the paper.

[1] https://www.microsoft.com/en-us/research/publication/automat...

More generally, anything by Sumit Gulwani's group at MSR should be of interest.

But maybe I did want it to remove the middle character! Using my training data, there’s no way for the system to actually know for sure what I meant. There is also no way (in general) for it to detect “outliers” and ask me about them, because there is no good way to know what is an outlier and what is not.

You can modify the output of the program synthesis to fit your needs. See this CurryOn! talk from 2015 by Sumit Gulwani


This is why this kind of software should have an interactive 'feedback' function that allows the user to select among several, equally likely rules.

The experimental Lapis editor[1] did exactly this, by the way.

[1] https://en.wikipedia.org/wiki/Lapis_(text_editor)

This isn't the machine's fault, though; for a small number of linearly independent examples there exists an enormous number of possible functions that match the training data. It has no way of guessing, really.

If the machine had a large background knowledge of what humans would typically like to do, it would help.

Sounds like you're talking about over-fitting the data. Or perhaps just not providing an evaluation function that is sufficiently general. All ML can fall into this trap.

When you only provide the system with a few examples, there are many possible transformations which satisfy the examples.

One way to eliminate this ambiguity is to also provide a natural language description of what you want, e.g. "remove the spaces".

In the natural language processing community, we call this semantic parsing.

But sometimes the semantic parser can misinterpret the language too and generate a program which still "cheats" in the same manner as you described. We call these "spurious programs".

Shameless plug-- my group has been working on how to deal with these spurious programs:

From Language to Programs: Bridging Reinforcement Learning and Maximum Marginal Likelihood https://arxiv.org/abs/1704.07926

You don't realise how big this is. This is the beginning of the automation of coding.

We had such things for decades.

Besides, it depends on the slope of "coding". If it gets really difficult really quick (exponentially say), this could just be forever stuck in the "low hanging fruit" stage.

No, assembly language was the beginning of automation of coding. Almost nobody codes using raw machine code anymore. Everything since is just more added abstractions.

I agree, thank you for peovoking this thought. It is raw and if so I apologize.

This is where hinting is important. Metadata. That sequence if I know it's a phone number, or a sequence of increasing digits, depends a lot on metadata.

Given some reasonable sample size, i believe machine learning could provide hints as to some of the common types of formats. Semi automated data hinting or structuring?

There is a bidirectional connection between interpreting your data and how your data is structured

Is it possible to use your data column to statistically hint at metadata characteristics by some sort of clustering, then use that to automatically clean input data?

This is a great product idea. If you ask any Excel power users, by far the most time-consuming and hard-to-automate task is text and date manipulation.

The beauty of this product is that its adoption strategy is baked into the product itself: I'd share this with all Excel user friends of mine because I want the algorithm to get smarter, and I might even learn a bit of C# myself so that I can contribute and scratch my own itch. This in turn makes the product better (because of the larger training data), lending itself to more word of mouth.

One concern I have is security: I'd love to hear from folks who built this/more familiar with this about how to ensure the security of suggested transformations.

I, too, wonder about security. Just as important: performance/scaleability. What happens when this runs on 100K rows against a service some guy stood up as a weekend project? Now what happens with 100 people hitting that service?

Either way, this looks very useful. Having spent more than my fair share of time massaging data prior to import, this looks pretty great.

> What happens when this runs on 100K rows against a service some guy stood up as a weekend project? Now what happens with 100 people hitting that service?

Then they complain to Microsoft, who helpfully suggests the product they should upgrade to. This has always been a strong spot of Microsoft's. "I see you've scaled beyond the capacity of [Product A]. Well, fortunately for you we have [Product B] which can handle it, with a nice import wizard to get you started painlessly." It typically goes Excel > Access > On-prem SQL Server > Azure.

This sounds very negative and I swear I don't mean it that way. It's a great sales tactic if you offer products at every level of scale.

SQL Server lite < SQL Server

Slightly different use case but still very useful for data manipulation is the OpenRefine (formerly Google Refine) https://www.youtube.com/watch?v=B70J_H_zAWM

I wonder if it uses Z3 under the hood for solving constraints. Very nice of MSFT to MIT license Z3. It's super useful for problems that result in circular dependencies when modeled in Excel, and require iterative solvers (e.g., goal seek). I use the python bindings, but unfortunately it's not as simple as `pip install` and requires a lengthy build/compilation. Well worth the effort, though.



Love Z3. It is easy to use and very decent performance! I don't think MS is using Z3 on this product though, looks more like the smart enumeration based program synthesis

I think most of the enumeration-based synthesis tools rely on a SMT solver (Z3 or CVC4, say) to learn from bad solutions.

Check out MagicHaskeller which figures out list processing functions from examples: http://nautilus.cs.miyazaki-u.ac.jp/~skata/MagicHaskeller.ht...

For example, given the rule `f "abcde" 2 == "aabbccddee"`, it even figures out the role of the parameter `2`, so `f "zq" 3` gives `"zzzqqq"`.

Wait, Excel had this built in since 2013!


I recall a joke from someone... Can't remember who, that went like, "does your startup compete with Excel? Did you just rediscover pivot tables?"

Is this related/a commercial application of the 'Deep Learning for Program Synthesis' post[0][1] from Microsoft Research on HN a month ago?


[1]HN Discussion: https://news.ycombinator.com/item?id=14168027

Oh man, we did it before Microsoft!



I'm playing around with a client-side js implementation of this at https://www.robosheets.com/

It's not production ready / launched yet, but it's getting there.

I'd be interested to finds (or really doesn't find) this useful :)

I wasn't able to figure out how to use the app, but I did want to drop in to say that I like how you've placed the "Upgrade to enable" notices in cells past a limited range.

This was also included in the query editor of Microsoft's Power BI in the release a month or two ago. First you select the columns to be used as a source then start writing example values to the new column to be generated. It also shows the generated M/PowerQuery expression.

It can't do miracles, but this is time saving in many cases like when you want to concatenate values from different columns in a new format into a single column and so on.

See also http://www.transformy.io/#/app

Ok, just realized somehow the site has vanished. Not working archived version: http://web.archive.org/web/20161028231256/https://www.transf...

Humans are really good at taking a vague description of a task and using a small number of examples to disambiguate it.

For example, "sort all of the folders, so that it Alan goes before Amy, etc". The rule ("sort") is pretty ambiguous, but one simple example in the context gives enough information to realise you probably mean alphabetically by first name.

Is there something like this example that could be combined with NLP to make things like these "intelligent assistants" we have now much more useful for data processing tasks?

It would be great to describe data manipulation to a machine the way that I would describe it to a colleague: give an overview of an algorithm, watch how they interpret it, and correct with a couple of examples in a feedback loop. Currently describing such things for a machine requires writing the algorithm manually in a programming language.

Isn't just because we've already been trained on that since learning the alphabet? Imagine giving a human the same question, but sort "aa" before "bb" but after "cc".

Maybe it's "just" because of previous training, but it's still a very useful ability, which programs do not have.

Being able to solve quickly the most common cases (which rely in such "common knowledge") would automate a lot of work that now requires writing a complex program in advance, and would allow the user to concentrate on the outliers that require more thought.

> Humans are really good at taking a vague description of a task and using a small number of examples to disambiguate it.

IIRC this is tested heavily in IQ tests.

Abstract thought really.

It would be nice if it indicated where it was making stuff up (in the zip code example, for the rows that were missing some data, it just makes it up - these rows are not distinguished visually from the rows where it did not add data not in the input.)

What I mean is if every row had a date like "12 May 2002" and you wanted it turned into 2002.05.12 then it would be nice if it indicated when it added data. For example if one of the rows just read "15 May" then, since there is no year, it would not be completely absurd if it transformed into 2017.05.15 - or if all of the other data is 2002, then adding that. But I really think silently adding data that was not in the input is going too far. A transform shouldn't ever silently inject plausible data with no indication that this is interpolated. Bad things can result.

Otherwise great demo!

I believe this is the implementation described in this paper published at POPL 2016:


Though it probably also uses more recent work from the same group:


Excel is a really powerful tool. If you are fine with needing Windows or Mac (e.g. not Linux) and you are ok with their licensing constraints it's pretty hard to beat.

the addon supports "excel online"

Relationship to FlashFill feature in Excel: FlashFill is a popular feature in Excel that also uses the example-driven paradigm to automatically produce transformations. While FlashFill supports string-based transformations, Transform Data by Example can leverage sophisticated domain-specific functions to perform semantic transformations beyond string manipulations. For examples, see: https://www.microsoft.com/en-us/research/wp-content/uploads/...

I hacked together something similar that learns row/column offsets for different fields in a text file, and converts it into a normal CSV, i.e. a normal table.


There is a paper describing such a method (not sure if that is what was implemented):

"Zhongjun Jin, Michael R. Anderson, Michael J. Cafarella, H. V. Jagadish: Foofah: Transforming Data By Example. SIGMOD Conference 2017: 683-698"

No, the paper co-incidentally shares a similar name, but the capabilities are different, and there is no relation in terms of people or underlying technology. https://www.microsoft.com/en-us/research/project/transform-d...

Seems similar to http://openrefine.org/

That's great, I always loved Auto Fill in Excel, and this brings it to the Mac.

* Correction. I meant Flash Fill. And this is isn't an exact replacement, but it's pretty close.

I would love something similar for Google Spreadsheet.

I want this in Vim :)

This would be great for refactoring code.

alas it is too late, it transformed our genes to dates, no sequence for Bill

not usable for companies and secured networks. :-( too bad

There's a huge opportunity in making excel better..

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact