Feature requests: allow for more than one example.
Input:
{"class": "101", "students": 101}
{"class": "201", "students": 80}
{"class": "202", "students": 50}
{"class": "301", "students": 120}
Example:
Class 101 has 101 students
Output:
Class 101 has 101 students
Class 201 has 201 students
Class 202 has 202 students
Class 301 has 301 students
Right now the first line cannot have any ambiguity. This is fixable by reordering, but with large enough data sets I may have some ambiguity in all lines, at different places. Multiple examples would fix that.
Again, loved the tool. I can see this going very far, specially with non-technical people.
For that use case you can use the Lapis[1][2] desktop app (the secret weapon I use for data munging), which allows you to choose several examples and edit a file with direct manipulation - or define patterns using a DSL.
You could just add a "dummy" input as the first line with all unique entries. Then just remove it from the output. Am I missing something? It's a tiny bit of additional manual processing, but doesn't seem unreasonable.
Right, but if you have a thousand items, with a dozen fields each, you can't be sure the one example you've picked will resolve all ambiguities. But if you could supply four or five example lines, the chances of ambiguity drop off.
I just attended a talk but Sumit Gulwani where he demo'd FlashFill, FlashExtract and walked through their underlying architecture. His talk had much cooler demos than any I could find online but from what I can tell it seems like they provide a superset of transformy's functionality (not to detract from it, this is very cool, and I'd be curious to see how related the underlying theories are). Apparently current work by that team at Microsoft is focused on abstracting out functionality into a system called FlashMeta so that it can be applied to a bunch of domain specific problems. Overall very exciting work, from both parties.
Nice, Excel really is one of the most undervalued pieces of software, it has so many details like this and it handles almost anything you throw at it. Pivot table is another of the lesser known features that is hard to live without once you've found out that it exists.
Given that Excel runs many, if not most financial organizations at a practical level, I'm not sure that it's undervalued in practice.
Excel is to finance people as terminals are to developers. When getting bug reports from end users, we've even had them delivered to us as screenshots embedded in Excel files.
Afaik, "LeBron" is his first name. The way you use it (and misspell it) is confusing, I'd suggest you replace him with a simpler alternative (Kobe Bryant? Tim Duncan? whatever).
Same for Kagawa -- the Japanese use surnames in a different way, might be simpler to replace him.
Sketching means to provide a partial specification, of which details are filled in by the system. But in this case, the user is providing a full description of a single concrete example. These are different concepts.
Thanks for the clarification, they are similar but not the same. It looks like _sketching_ is more analogous to using Hindley-Milner for filling in gaps in an executable spec, where as Programming by Example infers code from data (examples). There is a wealth of interesting material referenced in http://web.media.mit.edu/~lieber/Your-Wish-Intro.html
Sketching is a superset of programming by example. In examples, you only get input/output pairs, which as this thread shows, infers frustratingly close but wrong programs.
In sketching, the input and output can also includes partial programs (sketches), so you can mark what you like, tweak, etc, and it fills in the rest.
Increasing usability, the fragments can be in different languages. For example, input can be simple C and some test data, and output can be CUDA GPU code or pthread mutex locking schemes.
For Graphistry.com, we believe in these techniques for ETL and sketching visualizations.
(Also, sketching normally uses machine learning or SAT/SMT solvers: types are more typically used for input hints.)
Cool. I have been mulling what it would be like to have a semigraphical tool to generate SQL queries (Spark,Hive,VoltDB,etc) where the user would drive a keyboard applying operators over columns, rows and relations for mapping, filtering, joining. Like VIM plus ZPL and the goal is a statistical result over some range. Why can't prolog programs generate our programs? One of my query visualizations is an origami like structure where data is joined across a relation, something like an Explain Plan on Hollywoolsd.
By analogous, I meant more generally, using some partial knowledge about a program or spec to fill in missing pieces, not types specifically.
Nice to know folks are using these techniques to do real world tasks, I always thought something like this would be first used for cleaning data. Types, properties, sketches and examples.
On your main screen, make the example editable. It would be nice to be able to just enter into the green box to see how it works rather than have to click through the "Get Started"
Also, your instructions makes it seem like the example is editable:
SUPER EASY TO USE
1. Paste your source data in the white box on the left.
2. Type in the green box on the right how you would like the first line of your data to look.
3. Transformy will look at your example and transform every line from your source data into the same format.
A few feature requests: allow downloading the output as a text file; show a pseudo-code formula of how transformy interpreted the transformation, like "s/.+:\/\/(.+?)\/(.*)/\1, \2/"; and add support for common arbitrary transformations like "November"↔"NOV"↔"11", or "2"↔"2nd".
I think it's trying to be too magical. At this point it either seems to work, or something triggers it's pattern matching wrong and it's really hard to figure out what or why. I think giving back a little of the simpleness in favor of more control is worthwhile. For example, if the example portions that were formatting were differentiated from the data matching, it's not too complicated but intent is much clearer.
For example, if the rules were: example content must be contained within braces, and any braces within the example content need to be escaped, it's clear. At that point, your example becomes:
{example1.org}, {path/index.html}
It would still probably just return "wwwexample.4, g/hijklmnop" for the last example though, because it's ambiguous as to whether you want just the end of the url, or the whole thing. Allowing regex markup for more explicit matching would make it clearer still, but your example still causes problems until you go all the way to positive lookbehind assertions. At that point I need to learn all that, I might as well just use perl:
Amongst other things this can be used for cleaning tables/lists from special characters, changing date formats and creating xml or json.
Feedback and suggestions are very much welcome! We plan on adding a few more features soon as right now it is fairly basic but would like to hear some opinions and see if there's people out there that have a use for this.
This is really neat! Any chance this will be a cli tool or module/library?
It doesn't seem to play will with something like this as an input:
3, Roberto/Carlos, soccer, Brazil
35, Roberto/Carlos Michael Jordan, baseball, USA
6, Roberto/Carlos James Lebron, basketball, USA
10, Roberto/Carlos Shinji Kagawa, soccer, Japan
Format:
3, ROBERTO/CARLOS, soccer, Brazil
Gives me:
3, ROBERTO/CARLOS, soccer, Brazil
35, ROBERTO/CARLOS, Michael, Jordan
6, ROBERTO/CARLOS, James, Lebron
10, ROBERTO/CARLOS, Shinji, Kagawa
I can't seem to find a way to get it to parse that out properly (playing with the ROBERTO/CARLOS part.
I even tried this as an input:
3, Roberto Carlos, soccer, Brazil
35, Roberto Carlos Michael Jordan, baseball, USA
6, Roberto Carlos James Lebron, basketball, USA
10, Roberto Carlos Shinji Kagawa, soccer, Japan
Format:
3, ROBERTO CARLOS, soccer, Brazil
Gives me:
3, ROBERTO CARLOS, soccer, Brazil
35, ROBERTO CARLOS, Michael, Jordan
6, ROBERTO CARLOS, James, Lebron
10, ROBERTO CARLOS, Shinji, Kagawa
I formatted their examples to appear like some real data I have that appears like this, obviously not names but descriptions of some projects. I was curious how this would handle it.
In any case, get rid of the "/" and its closer to real. Some people have more than two names in their full name. And on a set a little larger you could very well have something close to my second example.
Currently it matches word by word, so for example if someone has a family name in two parts like "Van Buyten", it wont work. I think it's the same problem in your example: that the first "column" contains multiple words in some cases? We'll be fixing this in a future release!
I thought that a cli version might be useful too. The closest thing I have right now is sed/awk. Sed can do this kind of stuff but you have to specify a Regular Expression instead of a simple example. Because you have to be more specific about what you want, Sed will definitely handle those examples, with the caveat that you have to tell Sed what it is that you want to substitute and where for each line.
It took me about a year of use before I could figure out how to munge lines in it, so it's definitely not for the faint of heart. I use it for things like transforming excel spreadsheets into C struct arrays.
I was curious and sketched up something similar to this website in about a 100 lines of Python code. It has a CLI interface, have a look if you're interested:
Good concept but doesn't work. Example, type different variations of legal well formatted addresses.
1 Microsoft Way Apt 43, Redmond, WA 98065, U.S.A.
1-1/4 Palm Hwy, Colino, MA 87009, USA
500 Potasium Cloride, Sunshite-Big Blow City, PA 30000, United States of America
First line output should look like:
1 Redmond 98065 U.S.A.
Also having country-specific obscure sports terminology on landing page example can cause lot of confusion.
Right, marking areas (in this case terms between commas), like Google's Webmaster Tools structured information designer does it, would help.
This would require some kind of regular expressions in the example, at least. Of course this would make matters more complicated. This tool excels at simple data sets, I don't think it is meant to be universal.
Buggy, all I did is add Dean middle name to third line
In:
3, Roberto Carlos, soccer, Brazil
35, Michael Jordan, baseball, USA
6, James Dean Lebron, basketball, USA
10, Shinji Kagawa, soccer, Japan
Ex: Carlos is number 3 playing soccer
Out:
Carlos is number 3 playing soccer
Jordan is number 35 playing baseball
Dean is number 6 playing Lebron (what??)
Kagawa is number 10 playing soccer
I guess you can't really solve the ambiguity of Carlos meaning the second word on the second column versus the last word of the second column; but the commas should at least hint a tabular pattern, no?
Not sarcastic... I didn't know about FlashFill. I looked at the videos, and it seems that it's not as powerful as this tool. In excel the results would be a custom cell format combined with custom text functions.
Not sure about the implementation in Excel. At least the paper (POPL 2011) describes more powerful functions, containing e.g. case distinctions and simple loops. There are some examples in there. I'm not an Excel expert, but would be surprised if cell format could do that.
I could have used this hundreds of time in Visual Studio.
We often process messages with hundreds of fields. So I will have a class from specs (usually excel) with a number of properties, then I need a method to populate each of those from some other object, then I need a method that does the reverse - populates some other object from the properties.
This happens over and over for us. Typically I just use a Notepad++ macro. But I could probably use this as it stands, just having it in visual studio would be really incredible.
You know, for people that don't know Awk or aren't comfortable with a scripting language, this is a really nice idea. Thinking back to grad school, which was in a non-computer science quantitative field, there are lots of people that would have appreciated having something like this easily available.
I have a small piece of feedback on the site. You could make it a tiny bit clearer that this is a free, registration free, service which people can start using with just one click.
When I first visited the site, I looked it over, noticed the email box and the "get started" and just assumed it was a library I'd need to buy. It wasn't until I came back to the comments here that I realised the site was a service (which is actually extremely useful to me, and it has been bookmarked).
On this same subject, I would just make the boxes on the very first page editable so you can just play with it right away. After reading the description at the bottom I spent a few seconds trying to manipulate the text boxes on the front page before I realized I had to click a button. It would be really cool if they were live editable examples right on the front page.
This is a very cool tool. I wouldn't trust it with any sensitive info though. The lack of terms, https, and the fact that it's closed source means I have no idea of what could happen to the data I put in there.
If you're translating into text that's meant to be readable, it seems like you need to add a few items to your dataset that give additional information on natural language.
For example, I added some information in the example below about which pronoun to use based on gender. Would be really neat to have this sort of information built into the tool.
Example:
James is 30 years old and his favorite hobby is running
Output:
James is 30 years old and his favorite hobby is running
Erin is 28 years old and her favorite hobby is cooking
Owen is 3 years old and his favorite hobby is playing chase
Luke is 1 years old and his favorite hobby is reading
Shouldn't be hard to update it to allow a second example row to be given so as to disambiguate. Alternatively, just expect the user to give a more perspicuous example of what they want. (I like that word, sorry.)
One can achieve the same in sublime text using the multiple cursors and edit feature. This is great for non tech people.
For those of you who are wondering what sublime text can do, do give a visit to a sublime text video series on tutsplus, its awesome and teaches you the power of sublime text
Input:
Bogdan, "Yucca"
Josy, "Orange County"
Bill, "San Diego"
Example:
Bogdan lives in Yucca
Output:
Bogdan lives in Yucca
Josy lives in Orange
Bill lives in San
This is neat. I find myself wanting more detail on what works, though. For example, I c/ped your original example and tried "Roberto C. from Brazil."
It didn't infer that C. meant to truncate the last name, so everything ended up "John C." No biggie, but trying to figure out what does and doesn't work aside from tokenized string formatting was a bummer. Having the uppercase example led me to believe it could do more types of transformation.
Possible the right answer is hinting of some kind. "Roberto {C.} from Brazil" to hint that the C. should be matched with -something-, and since . naturally means abbreviation would mean "starts with C".
For more, I gave a feel for how to rethink the full data pipeline using these ideas @ Strange Loop: http://www.infoq.com/presentations/dsl-visualization . It pulls on several projects from program synthesis @ berkeley. (These directly led to applications mentioned here like flashfill.)
I'd usually use 'perl -n -e' if I needed something like this. Not suggesting that perl is a better alternative but it's a reason why I'd never need to use that tool.
Here's the corresponding perl program using the same data and output as on the transformy home page:
pbpaste | perl -p -e 's/(\d+), (\w+) (\w+), (\w+), (\w+)/@{[uc($3)]}, jersey number $1/'
To use that (pbpaste I think is a mac only feature) first copy / paste that into your terminal, then copy the list, then hit enter in the terminal.
This is pretty neat. Since were asking for features: smarter date conversions. For instance on the input: '2015-04-24', and for example output: '26 April 2015'. The if another line has '2015-03-01' it would output '1 March 2015'. This seems like a somewhat difficult problem, but it'd be magical if it worked.
I filter underscores to dashes, because I want a C function like foo_bar to look like foo-bar in the Lisp dialect. The (void) argument list is handled as a special case.
I need to parse the arguments because the output part needs to know how many there are. Note how "func_n2" is generated, where the 2 comes from the argument count.
The last example with JSON replace is really nice, I do this regularly with regular expression find-replace with groups, on larger datasets. I guess i can forget about regular expressions now. Nice work!
I'd be keen to use something like this as an offline library. Is there anything similar that exists out there for auto-detecting data formatting and structure, but as a library instead?
I tried a list of unicode characters and their codepoint numbers, but it doesn't seem to recognize the unicode characters probably. Perhaps a normalization issue?
Really great. Manipulating lists is the hardest thing in the world for common people, while for any programmer it is really easy. This should even the things a little.
We don't interpret the data, we only change the format. Since "November" doesn't literally match anything on the first line, it is repeated for each one of them.
This is expected behavior. I'm assuming you gave "a 1" as an example? Since your first line didn't contain the number 1, it is repeated over every line.
Is anyone else bothered by the fact that 5 years ago, this would have been a free command line tool? But nowadays it's a closed-source web app instead?
The audience who are likely to make the most use out of a tool like this are not the same as the audience who would be comfortable using a command line tool.
I mean, you can replicate the core functionality of this fairly easily using awk, and if you're happy doing a bit of piping to perl or whatever, the fancier time re-formatting stuff is also easy.
In essence, the complexity in this tool (and what makes it cool) is the figuring out what you are trying to do without telling it - if you can run a command line tool you can tokenise the input yourself and you're most of the way there already.
Feature requests: allow for more than one example.
Right now the first line cannot have any ambiguity. This is fixable by reordering, but with large enough data sets I may have some ambiguity in all lines, at different places. Multiple examples would fix that.Again, loved the tool. I can see this going very far, specially with non-technical people.