
Transform Data by Example [video] - gggggggg
https://www.microsoft.com/en-us/research/project/transform-data-by-example/
======
teddyh
You know what this reminds me of? Those trained neural-net things which,
however many training examples you give it, always seem to find some way to
“cheat” and not do what you want while still obeying all your training data
correctly.

Something like this: Suppose we have a table of strings of digits, some
including spaces, and we’d like to remove the spaces. From

    
    
      123 456
      234567
      345 678
    

to

    
    
      123456
      234567
      345678
    

Now, what happens if it encounters, say

    
    
      4567890
    

Would the result be unchanged (as we would probably want), or would it “cheat”
and remove the middle “7” character, giving “456890”?

~~~
function_seven
This is why I want any ML device to be able to explain itself. It could train
on your before-and-after examples and come up with a list of what it thinks
you want it to do.

For your example, it could list:

    
    
        “Remove interior spaces from each item”
    

or it could say:

    
    
        “Remove the middle character from any 7-character strings to make them 6 characters in length”
    

You would be able to do something with that.

~~~
paulddraper
Neural nets are infamous for not doing this.

Learning algorithms that produce decision trees are usually used in this
situation.

~~~
nilkn
This might be a dumb question, but let's say that for whatever reason on a
specific problem it's much easier to train a neural network that generalizes
well than a decision tree. Why not train the network, then build an equivalent
decision tree that just tries to reproduce the network's output? When building
the tree from the network, overfitting would not be a concern. In fact, you'd
_want_ it to overfit.

You could even say that it only needs to approximately reproduce the output
with some tunable error threshold, which might give you leeway for finding
more comprehensible and simpler trees.

~~~
randomsearch
> This might be a dumb question, but let's say that for whatever reason on a
> specific problem it's much easier to train a neural network that generalizes
> well than a decision tree. Why not train the network, then build an
> equivalent decision tree that just tries to reproduce the network's output?
> When building the tree from the network, overfitting would not be a concern.
> In fact, you'd want it to overfit.

You haven't fixed anything here. You've just encoded your training data in a
neural net and then presented the same problem to the decision tree learner.
Unless you're planning to transform your training data somehow?

~~~
Houshalter
This solves the problem of interpretability. You can't interpret the weights
of a neural network, but you can easily follow along a decision tree and see
if it's doing what you want.

Actually that's somewhat less true for big decision trees. But the general
point is that you can train interpretable models to mimic the output of
uninterpretable black boxes.

The biggest issue is that decision trees only work for data with fixed inputs
and outputs. Recurrent NNs work on a time series and possibly even have
attention mechanisms.

~~~
jmmcd
No, it doesn't make sense. The training data (inputs and NN-predicted outputs)
that you're feeding into the DT is at best the same as the training data
(inputs and desired outputs) you had originally.

~~~
Houshalter
You can generate infinite training data with the NN by feeding in random
inputs and seeing what outputs it gives. You can then train whatever model you
want on it without concern for overfitting.

But more importantly, the decision tree will model the behavior of the NN, not
necessarily the original data. Which is what you want, if your goal is to
understand what function the NN has learned.

~~~
jmmcd
The point about infinite training data is potentially useful. The other one I
still don't agree with. Your goal is only to understand the NN insofar as it
models the original data. Any errors the NN is making are not worth learning
about. So it would be better to train the understandable method (DT) on the
original data.

~~~
Houshalter
>Any errors the NN is making are not worth learning about.

But that's the whole point of this method! To understand what errors the NN
might be making. It's also quite possible the NN's errors aren't really
errors, if there are mistakes or noise in the labels.

This technique has been called "dark knowledge" and is really interesting. See
[http://www.kdnuggets.com/2015/05/dark-knowledge-neural-
netwo...](http://www.kdnuggets.com/2015/05/dark-knowledge-neural-network.html)
They train much simpler models to get the same accuracy as much bigger models,
just by copying the predictions of the bigger model _on the same data_. In
fact you can get crazy results like this:

>When they omitted all examples of the digit 3 during the transfer training,
the distilled net gets 98.6% of the test 3s correct even though 3 is a
mythical digit it has never seen.

~~~
jmmcd
Ah, very interesting! I agree that would be useful. But I think this thread
has ended up with a proposal very different from the one I started replying
to.

------
ktamura
This is a great product idea. If you ask any Excel power users, by far the
most time-consuming and hard-to-automate task is text and date manipulation.

The beauty of this product is that its adoption strategy is baked into the
product itself: I'd share this with all Excel user friends of mine because I
want the algorithm to get smarter, and I might even learn a bit of C# myself
so that I can contribute and scratch my own itch. This in turn makes the
product better (because of the larger training data), lending itself to more
word of mouth.

One concern I have is security: I'd love to hear from folks who built
this/more familiar with this about how to ensure the security of suggested
transformations.

~~~
haswell
I, too, wonder about security. Just as important: performance/scaleability.
What happens when this runs on 100K rows against a service some guy stood up
as a weekend project? Now what happens with 100 people hitting that service?

Either way, this looks very useful. Having spent more than my fair share of
time massaging data prior to import, this looks pretty great.

~~~
Analemma_
> What happens when this runs on 100K rows against a service some guy stood up
> as a weekend project? Now what happens with 100 people hitting that service?

Then they complain to Microsoft, who helpfully suggests the product they
should upgrade to. This has always been a strong spot of Microsoft's. "I see
you've scaled beyond the capacity of [Product A]. Well, fortunately for you we
have [Product B] which can handle it, with a nice import wizard to get you
started painlessly." It typically goes Excel > Access > On-prem SQL Server >
Azure.

This sounds very negative and I swear I don't mean it that way. It's a great
sales tactic if you offer products at every level of scale.

~~~
paulddraper
SQL Server lite < SQL Server

------
Cieplak
I wonder if it uses Z3 under the hood for solving constraints. Very nice of
MSFT to MIT license Z3. It's super useful for problems that result in circular
dependencies when modeled in Excel, and require iterative solvers (e.g., goal
seek). I use the python bindings, but unfortunately it's not as simple as `pip
install` and requires a lengthy build/compilation. Well worth the effort,
though.

[https://github.com/Z3Prover/z3](https://github.com/Z3Prover/z3)

[https://github.com/Z3Prover/z3/issues/288](https://github.com/Z3Prover/z3/issues/288)

~~~
tonmoy
Love Z3. It is easy to use and very decent performance! I don't think MS is
using Z3 on this product though, looks more like the smart enumeration based
program synthesis

~~~
c-cube
I think most of the enumeration-based synthesis tools rely on a SMT solver (Z3
or CVC4, say) to learn from bad solutions.

------
gergoerdi
Check out MagicHaskeller which figures out list processing functions from
examples:
[http://nautilus.cs.miyazaki-u.ac.jp/~skata/MagicHaskeller.ht...](http://nautilus.cs.miyazaki-u.ac.jp/~skata/MagicHaskeller.html)

For example, given the rule `f "abcde" 2 == "aabbccddee"`, it even figures out
the role of the parameter `2`, so `f "zq" 3` gives `"zzzqqq"`.

------
bcherny
Wait, Excel had this built in since 2013!

[https://support.office.com/en-us/article/Use-AutoFill-and-
Fl...](https://support.office.com/en-us/article/Use-AutoFill-and-Flash-
Fill-2e79a709-c814-4b27-8bc2-c4dc84d49464)

~~~
taeric
I recall a joke from someone... Can't remember who, that went like, "does your
startup compete with Excel? Did you just rediscover pivot tables?"

------
netvarun
Is this related/a commercial application of the 'Deep Learning for Program
Synthesis' post[0][1] from Microsoft Research on HN a month ago?

[0][https://www.microsoft.com/en-us/research/blog/deep-
learning-...](https://www.microsoft.com/en-us/research/blog/deep-learning-
program-synthesis/)

[1]HN Discussion:
[https://news.ycombinator.com/item?id=14168027](https://news.ycombinator.com/item?id=14168027)

------
martinthenext
Oh man, we did it before Microsoft!

[http://comnsense.io/](http://comnsense.io/)

[https://youtu.be/ALF9GY2K-wc](https://youtu.be/ALF9GY2K-wc)

------
wayneprice
I'm playing around with a client-side js implementation of this at
[https://www.robosheets.com/](https://www.robosheets.com/)

It's not production ready / launched yet, but it's getting there.

I'd be interested to finds (or really doesn't find) this useful :)

~~~
eob
I wasn't able to figure out how to use the app, but I did want to drop in to
say that I like how you've placed the "Upgrade to enable" notices in cells
past a limited range.

------
gerhardi
This was also included in the query editor of Microsoft's Power BI in the
release a month or two ago. First you select the columns to be used as a
source then start writing example values to the new column to be generated. It
also shows the generated M/PowerQuery expression.

It can't do miracles, but this is time saving in many cases like when you want
to concatenate values from different columns in a new format into a single
column and so on.

------
fiatjaf
See also [http://www.transformy.io/#/app](http://www.transformy.io/#/app)

Ok, just realized somehow the site has vanished. Not working archived version:
[http://web.archive.org/web/20161028231256/https://www.transf...](http://web.archive.org/web/20161028231256/https://www.transformy.io/#/)

~~~
awake
[https://apps.synesty.com/transformy/en-
us](https://apps.synesty.com/transformy/en-us) this might be it

------
unfamiliar
Humans are really good at taking a vague description of a task and using a
small number of examples to disambiguate it.

For example, "sort all of the folders, so that it Alan goes before Amy, etc".
The rule ("sort") is pretty ambiguous, but one simple example in the context
gives enough information to realise you probably mean alphabetically by first
name.

Is there something like this example that could be combined with NLP to make
things like these "intelligent assistants" we have now much more useful for
data processing tasks?

It would be great to describe data manipulation to a machine the way that I
would describe it to a colleague: give an overview of an algorithm, watch how
they interpret it, and correct with a couple of examples in a feedback loop.
Currently describing such things for a machine requires writing the algorithm
manually in a programming language.

~~~
Spiritus
Isn't just because we've already been trained on that since learning the
alphabet? Imagine giving a human the same question, but sort "aa" before "bb"
but after "cc".

~~~
TuringTest
Maybe it's "just" because of previous training, but it's still a very useful
ability, which programs do not have.

Being able to solve quickly the most common cases (which rely in such "common
knowledge") would automate a lot of work that now requires writing a complex
program in advance, and would allow the user to concentrate on the outliers
that require more thought.

------
logicallee
It would be nice if it indicated where it was making stuff up (in the zip code
example, for the rows that were missing some data, it just makes it up - these
rows are not distinguished visually from the rows where it did not add data
not in the input.)

What I mean is if every row had a date like "12 May 2002" and you wanted it
turned into 2002.05.12 then it would be nice if it indicated when it added
data. For example if one of the rows just read "15 May" then, since there is
no year, it would not be completely absurd if it transformed into 2017.05.15 -
or if all of the other data is 2002, then adding that. But I really think
_silently_ adding data that was not in the input is going too far. A transform
shouldn't ever silently inject plausible data with no indication that this is
interpolated. Bad things can result.

Otherwise great demo!

------
mballantyne
I believe this is the implementation described in this paper published at POPL
2016:

[https://www.microsoft.com/en-
us/research/publication/transfo...](https://www.microsoft.com/en-
us/research/publication/transforming-spreadsheet-data-types-using-examples/)

Though it probably also uses more recent work from the same group:

[https://www.microsoft.com/en-
us/research/people/sumitg/](https://www.microsoft.com/en-
us/research/people/sumitg/)

------
gshulegaard
Excel is a really powerful tool. If you are fine with needing Windows or Mac
(e.g. not Linux) and you are ok with their licensing constraints it's pretty
hard to beat.

~~~
cinch
the addon supports "excel online"

------
tdbeteam
Relationship to FlashFill feature in Excel: FlashFill is a popular feature in
Excel that also uses the example-driven paradigm to automatically produce
transformations. While FlashFill supports string-based transformations,
Transform Data by Example can leverage sophisticated domain-specific functions
to perform semantic transformations beyond string manipulations. For examples,
see: [https://www.microsoft.com/en-us/research/wp-
content/uploads/...](https://www.microsoft.com/en-us/research/wp-
content/uploads/2017/02/Sample_Reduced-1.xlsx)

------
JoelJacobson
I hacked together something similar that learns row/column offsets for
different fields in a text file, and converts it into a normal CSV, i.e. a
normal table.

[https://github.com/trustly/fixed2csv](https://github.com/trustly/fixed2csv)

------
matt4711
There is a paper describing such a method (not sure if that is what was
implemented):

"Zhongjun Jin, Michael R. Anderson, Michael J. Cafarella, H. V. Jagadish:
Foofah: Transforming Data By Example. SIGMOD Conference 2017: 683-698"

~~~
tdbeteam
No, the paper co-incidentally shares a similar name, but the capabilities are
different, and there is no relation in terms of people or underlying
technology. [https://www.microsoft.com/en-
us/research/project/transform-d...](https://www.microsoft.com/en-
us/research/project/transform-data-by-example/)

------
captnswing
Seems similar to [http://openrefine.org/](http://openrefine.org/)

------
copperx
That's great, I always loved Auto Fill in Excel, and this brings it to the
Mac.

~~~
copperx
* Correction. I meant Flash Fill. And this is isn't an exact replacement, but it's pretty close.

------
Kiro
I would love something similar for Google Spreadsheet.

------
amelius
I want this in Vim :)

This would be great for refactoring code.

------
tejtm
alas it is too late, it transformed our genes to dates, no sequence for Bill

------
cblte
not usable for companies and secured networks. :-( too bad

------
sjg007
There's a huge opportunity in making excel better..

