Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Show HN: Sketch – AI code-writing assistant that understands data content (github.com/approximatelabs)
252 points by bluecoconut on Jan 16, 2023 | hide | past | favorite | 49 comments
Hey HN!

I’m excited to share sketch: a tool to help anyone who uses python and pandas quickly iterate and get to answers for their data questions.

Sketch installs as a pandas extension that offers utility functions that operate on natural language prompts. Using the `ask` interface you can get answers in natural language. Using the `howto` interface you can get get python and pandas code directly. The primary benefit of this over copilot and chatGPT is that this adds data-content based context so that the generated answers are much more accurate and relevant to the data problem at hand.

Check out the demo video[1] and try it out using the colab notebook (on github)!

[1] https://user-images.githubusercontent.com/916073/212602281-4...



Very cool demo!

Regarding the choice of name, presumably you already know about Sketch, the popular image editing software.

I wonder if the image editing guys will in the future incorporate AI functionality too? Which might make "Googling" for your product difficult for your potential customers?


There's also a program synthesis project called Sketch, which is much closer to the domain of what the user posted: https://people.csail.mit.edu/asolar/


That seems to be a presumption...

I just googled "sketch application" or "sketch image editing software" I found sketchpad... alternatively, I found sketch.com, which seems to be design software?

I mean, it's a vague enough name that I don't think it's a good name anyways, but I'm not sure it should be obvious that this is a taken name in the way that "Photoshop" would be.


This presents another issue with other sites out ranking you.


This is fantastic and exactly where our team at Shipyard is expecting the data space to go. Context aware, AI driven. Great work on this!

We were just talking last week about how we should create a feature to describe transformations you want in Natural Language that get compiled to pandas/SQL. Input data is everything associated with the original file/dataframe.

Visual transformation tools are typically limited and non-reproducible. If you could switch it around to be code-compiled but description-driven, that would open up new possibilities.

I'd love to chat if you're open to it. Email in bio.


I'd love something like a standalone SQL IDE where I can ask an AI to generate queries or migration scripts.

Sadly to be honest, I don't think I'd pay a subscription for such a service. I would prefer to pay a one time tooling fee and just run trained model in the IDE locally.


I did something similar to it for my own use. Using natural language it make sql queries to your .csv, xlsx (soon I'll add features so you can connect to databases). but it is not mature enough to sell as a service. Feel free to reach me info [at] rafaelmelhem . com if you want and I send a demo :)


Yeah the risk of your sql walking off to an AI vendor is not worth the time savings.


This is a great demo, OP.

I'm wondering about the UX of this vs Copilot. is this basically just a way to get around the fact that you dont have Copilot inside of notebooks? what else am I missing about this experience?


Thanks!

That is definitely a big part of it, getting to use copilot style answers without having to install any plugins to the IDE (so getting to use this in colab or jupyter notebooks directly feels great).

That said, I use both copilot and sketch in my VScode notebooks, and find that they have slightly different feelings to the iteration loop.

Sketch offers a more "local" data context (pinning the text/prompt to the specific dataframe) which increases the quality of the suggestions (since more relevant information is within the token limit).


Copilot does work in notebooks, at least inside pycharm and vs code. I ask it for pandas solutions ask the time.


Great work, and a really interesting application of GPT3. Some time ago I developed Datasloth [1] which might be a nice complementary feature to Sketch. Ping me if you're interested to bounce ideas :)

[1] https://github.com/ibestvina/datasloth


Well, I'm locked out of my github account right now and don't feel like going through all those hoops right now but I wanted to point something minor out.

In this line, https://github.com/approximatelabs/sketch/blob/9d567ec161015...

I think you can end up marking control characters as "UNKNOWN" characters by accident by assuming that in all contexts/environments that dictionary.items() always returns items in a consistent order. This isn't always true.

edit: actually with the way the code is written if you have any overlapping ranges at all you'll end up double/triple/etc. counting a character into multiple categories.


Does using this mean sending all of your potentially private data via an api call to openAI?


From https://github.com/approximatelabs/sketch/blob/main/sketch/p... it appears that this library is calling a remote API, which obviates the utility of the demonstrated use case.

Upon closer inspection, it looks like https://github.com/approximatelabs/sketch interfaces with the model via https://github.com/approximatelabs/lambdaprompt, which is made by the same organization. This suggests to me that the former may be a toy demonstration of the latter.

Interesting how as of the time of writing this, most of the comments here (i.e. dozens) are praising this as a legitimate use case. Maybe I'm missing something obvious, but it seems clear to me that uploading data to a third party to verify whether that data contains PII is a non-starter for any serious application.


"Does this data contain PII?"

"Yes, and you just shared it all with Microsoft :D"


Right now, it sends the first five rows of the dataframe: https://github.com/approximatelabs/sketch/blob/9d567ec161015...


Looks really nice, but I tried it:

  import sketch
  import pandas as pd

  data_pd = pd.read_csv("input.csv", sep=';')
  print(data_pd)
  print(data_pd.sketch.ask("Is there any PII in this dataset ?"))
  print(data_pd.sketch.ask("Which columns are integer type?"))

With this input.csv:

  name;age;address;phone
  Bob;34;106 DOYERS ST. 8 ARLINGTON DR. 599 NW BAY BLVD;1-541-754-3010
  Anna;34;694 Short Street, Austin, Texas;001-541-754-3010

And I have no results (and no runtime error as well) :-( Here is the console output:

     name  age                                         address             phone
  0   Bob   34  106 DOYERS ST. 8 ARLINGTON DR. 599 NW BAY BLVD    1-541-754-3010
  1  Anna   34                 694 Short Street, Austin, Texas  001-541-754-3010
  <IPython.core.display.HTML object>
  None
  <IPython.core.display.HTML object>
  None
Am I missing something ? The "ask" interface doesn't seems to need external OpenAI credentials right ?


to get the strings of the results back out, add the kwarg `call_display=False` to the functions.

so: ``` print(data_pd.sketch.ask("Is there any PII in this dataset ?", call_display=False)) ``` should work for you.

Right now it by default assumes its in an ipython context that can display HTML objects.


Ah yes it displayed the string, thanks!

But the result looks wrong with this input:

     age                                         address
  0   34  106 DOYERS ST. 8 ARLINGTON DR. 599 NW BAY BLVD
  1   34                 694 Short Street, Austin, Texas
It says:

  No, there is no PII (personally identifiable information) in this dataset. The only columns are index, age, and address, none of which contain any sensitive information.

Sometimes, it seems to work with phone number though. Here: age address phone 0 34 106 DOYERS ST. 8 ARLINGTON DR. 599 NW BAY BLVD 1-541-754-3010 1 34 694 Short Street, Austin, Texas 001-541-754-3010

  Yes, this dataset contains PII (personally identifiable information) such as age, address, and phone number.
I retried:

     pirce                                         address             phone
  0    123  106 DOYERS ST. 8 ARLINGTON DR. 599 NW BAY BLVD    1-541-754-3010
  1  43543                 694 Short Street, Austin, Texas  001-541-754-3010

  No, there is no personally identifiable information (PII) in this dataset. The columns contain only generic information such as index, price, address, and phone number. None of these columns contain any information that could be used to identify an individual.
Which is wrong. Is there explanation ?


Just played around with this and I think I'll be using it on some research projects!

One cool feature would be some sort of chaining, where you could anchor a new query to a previous one.

For example, on the sales data demo, I started with the howto query "Plot the sales per month in a bar chart using plotly."

However, I got a bug since "Order Date" wasn't a datetime, so I added "Make sure to make 'Order Date' a date column." The new code worked, but gave months as integers 1-12.

When I added "Include month name on x-axis (e.g., Jan, Feb, ...).", the model sort of gave up and spit out some buggy code that didn't make a bar plot.

In this example, it would be great to be able to chain the howto commands, so the previous result is used as context for the new one.


I spent few weeks last year building a text to sql tool using codex model to do something like this but for all kinds of data sources. We pivoted away to something else for various reasons.

But your approach is much better. Pandas is used a lot. Build a tool on top of pandas. This is awesome.


https://hal9.com is focused on building data apps with LLMs, would love to explore integrating and contributing to Sketch. If this sounds interesting I’m at javier at hal9.ai


Very promising. I believe the uses of OpenAI that will stick in the long term are like this, and other tools should be experimenting with this kind of integration.

Otherwise, there's room for other solutions, as airops sidekick [1] that uses browser extensions to embed itself in other data tools.

1- https://www.airops.com/


Damn, this looks pretty useful. I was finding that github copilot was really good at reading a CSV file and writing all the imports from that into migrations for DB import, but this looks like it does these data transformations even more robustly.

Is there any plans on getting this to work outside of the python/pandas ecosystem or is it intrinsically tied to that environment?


I use TabNine [0] for local context aware AI suggestions, and I find it spookily good at guessing what I'm half way through typing. Sadly they've left the Sublime plugin to rot and it's mostly a hinderance in ST4.

[0] https://www.tabnine.com


Hi, cool stuff! Which LLM is being used in the background? I may have missed that info in the readme. Thanks!


Thanks!

Right now this is running off of GPT-3 (`text-davinci-003`) and via a small code change can run on codex (`code-davinci-002`) but the quality only improves a little bit with that change.

That said, this is the first version to show that the interface is viable; we are currently working on training our own foundation model on a hybrid tokenization of data and word tokens. I hope to improve this same toolkit in the future with these new models of our own that we are training.



This is very cool. A useful case for gpt. One question / concern: isn't a person's address considered PII? Is the system flexible enough to add pre-statements such as "treat an address as PII"?


Related question: is this done on my machine or do I end up sending possible pii to a cloud service for evaluation?


This is sending summary statistics to a cloud machine by default (for ease of immediate use. https://github.com/approximatelabs/sketch#sketch-currently-u...

You can run using your own OpenAI key by setting 2 environment variables: (1) SKETCH_USE_REMOTE_LAMBDAPROMPT=False (2) OPENAI_API_KEY=YOUR_API_KEY

To run entirely locally (using your own GPU and a model like Bloom) one would have to add a new prompt type to `lambdaprompt` (the package that this depends on), have a machine with enough GPU resources, and then add a slight modification to sketch.


Not sure if this is a business you're building out of this or an experiment. For real use for any of my customers, I would need to run this entirely locally.

I think it's really awesome though!

Curious what "enough GPU resources" looks like? Would a GeForce RTX 40 or 30 series card with 12-24GB of RAM be sufficient per user running locally on their machine?


If this is using OpenAI which it seems is what it is using, It is only sending column headers / column names. Not the data. If you are concerned about column names, you could also mask it on the way out and back in. If you are looking for an end to end database connect and query, please reach out to me.


Cool project, although the name kinda clashes with the well-known https://www.sketch.com/ in the UI/UX design space


This is very cool! I've literally today been noodling with ideas to use probabilistic data structures in LLMs.

And TIL you can embed mp4s in a GitHub readme. Is that new?


I don't have any experience with pandas. Can this directly connect to a db and run queries there (video seems to load a csv file).


If you can already write SQL to return a data set then you can get that set to pandas with pyodbc.


So... Microsoft bought 48 or 49% of OpenAI right? Integrating this into Excel would make everyone an excel power user.


A lot of people already uses excelformulabot. The impact of something integrated into Excel would be pretty big.


It’s already integrated into Excel with the add-on.

What else did you have in mind?


But if it makes a logical mistake, it would take a real power user to notice it.


There is a good chance a real power user wouldn't notice it either.

It is like the race is on to make a really dumb, money losing business decision based on a ridiculous ChatGPT error.

It is like wanting to take the first iteration of the DARPA challenge self driving cars out on the freeway for a test drive. Good luck with that.


But wouldn't you need to integrate Python into Excel for this to work?


Really cool and helpful. Is there anything similar for R?


GPT3 model generates a SQL. You can sqldf on top of your data.table. We will be demo'ing at one of the events shortly. BTW, you could do somewhat similar with other LLMs such as GPTJ and GPT NEOX if you have worked with them


is GPTJ/NEOX good enough to generate code? tried it with SQL and it was really disappointing


They are decently good, I could not find major differences for the cases I was trying. The key is to control the temperature. Make sure it is low, otherwise the randomness increases tremendously. Infact you can feed the same input from openAI into NEOX and it generates results. There are many NEOX open playgrounds that allow you the control the temperature etc.


Cool! Will check it out thanks.


[deleted]




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: