Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: If VS Code had a data-centric IDE sibling, what would that look like? (github.com/code-kern-ai)
122 points by jonathan_re on July 18, 2022 | hide | past | favorite | 57 comments



Jetbrains has one specifically for this: DataSpell - https://www.jetbrains.com/dataspell/.

It's like a tweaked version of DataGrip + PyCharm, but catering specifically to the particular needs of data scientists.


I just got this to make working with Zeppelin worksheets more bearable. It's such an improvement over the web interface, it's hard to describe. At least for someone already used to their software. As a bonus, I also get the best Cassandra GUI I've seen so far.


Microsoft has a fork of VS Code called Azure Data Studio. It's made for DB queries and notebooks: https://github.com/Microsoft/azuredatastudio


> Microsoft has a fork of VS Code called Azure Data Studio

It's pretty buggy, and they haven't made much progress on fixing the bugs. I've used it pretty much since it came out - I do Microsoft SQL Server & Postgres work on my Mac - and the thousands of open issues on a relatively new product say something:

https://github.com/microsoft/azuredatastudio/issues


I've also had queries (usually really large ones) run without issue in SSMS but crash Azure Data Studio. The only advantages that Azure Data Studio really has over SSMS is that it's a lot snappier (unless you're working with really large queries) and it has a dark mode. Other than that SSMS seems better in just about every way still.


SSMS is also a multi-GB install full of stuff not useful to anyone who isn't a 100% fulltime DBA versus Azure Data Studio is barely more than the usual Electron app install footprint and leaves the lesser needed functions to plugins/extensions, and ADS supports cross platform work (the above commenter mentions working on macOS) where SSMS is not cross platform at all. Azure Data Studio has plenty of advantages over SSMS.

As a developer, I understand my need to have all 20-120+ GBs of Visual Studio installed, but between SSDT in Visual Studio and Azure Data Studio, I am happy to avoid spending all that hard drive space on SSMS for features I don't need as a developer.


I've been using Azure Data Studio for a few tasks over the last year or so, and it's worked fine. Just another data point. I'm running it under Linux, because SSMS is not available for Linux.


>and the thousands of open issues on a relatively new product say something:

That's generally a good thing. If a product had no issues that would mean that there's no uptick in usage.


Followup: Is there a VSCodium equivalent for that (reproducible build, no tracking)


I think there isn't, and if there is it would be illegal:

> You may not sublicense the Software Code or any use of it

https://github.com/microsoft/azuredatastudio/blob/main/LICEN...


To publish a version of Azure Data Studio with open-source binaries would fall under copy/modification rules, which are explicitly allowed in the license:

> Microsoft Corporation ("Microsoft") grants you a nonexclusive, perpetual, royalty-free right to use, copy, and modify the software code provided by us ("Software Code").

It would not be a sublicense.


Hey, I'm Johannes - one of the maintainers of refinery. Thanks Jonathan for sharing!!

Would be super excited if you guys have any feedback. It's nowhere near perfect yet, but you can already use it to build some great data-centric use cases. Amongst others for sentiment analysis, conversational AI or finetuning of your embeddings (which you can check out here: https://github.com/code-kern-ai/refinery-sample-projects).

Let me know what you think :)


The title should probably reflect that this is specifically for managing NLP labeling tasks. It looks like a great project! Years ago I bought a book on data prep and labeling for NLP, and based on that book this project looks like it covers the main workflows you would need.


Hey Mark, I totally agree. We're focusing on NLP, but we're generally interested in what programming will develop into.

To exaggerate a bit, but I like that idea: With "regular programming" (not the best term, but I mean rule-based systems etc.), you used to develop via punch cards. You had to think multiple times before "compiling" something, right? I believe that we're currently in that phase regarding supervised learning development. If you have a labeling project, you need to plan this long in advance, ...

We're in love with VS Code, but we're missing something like this for AI. Our application tries to show how developers can program their training data, i.e. refine raw data into training data, and do so with many programmatic approaches. We're trying to show how something like this could look like (hence the title), and do so in NLP.

But again - I agree with you :)


In case you're looking for a VS Code extension to quickly preview, filter and plot data from various file formats you can check out vscode-data-preview [0]

[0]: https://marketplace.visualstudio.com/items?itemName=RandomFr...


Awesome, thanks for the suggestion. Already installed :)


This actually seems like a major leap forward in a really underloved space. Congratulations on your release.

If anyone is remotely interested in data-labelling/exploration, I would definitely recommend checking this out, it has some really exciting features, for example, built-in zero shot classification for heuristics/baselines: https://docs.kern.ai/docs/building-zero-shot-classifiers

I'm also really impressed with the architecture! Very neat.

Not affiliated with the project, just very pleased to see something like this as an open source release.


I really think that too. This space is driven by non open-source labeling tools without any possibility for customization. really appreciate that we see something like that as an open-source project. Will definitely bring this space in the right direction.


team member of kern.ai here

> I’m also really impressed with the architecture! Very neat.

thank you very much for your feedback, really appreciate it :) We work hard on the architecture of the product and are wrapping our heads constantly around how to make things faster, more stable, and scalable, which is much fun to do for such a data-intensive application. We are open to constructive feedback on our tool so please feel welcome to join our community on discord https://discord.gg/JzA3zDH2


Thanks, means the world!


It would look a lot like JetBrains DataGrip.


I'm fond of Visidata, not an quite IDE but a good start

https://www.visidata.org/


Looks interesting, I'll check it out!


As I work mostly on MSSQL Server, the sweet spot for me is SSMS with RedGate SQL Prompt.

For non-MSSQL things, it's usually DBeaver.

No need for a new IDE for me.


DBeaver is my go-to for anything that it supports (PostgreSQL, mainly). It's glorious.


Kind of hard to imagine a VS Code sibling for the whole data centric eco system. Maybe something like a base platform with multiple extension points for different tasks and the ability for others to extend the platform? (so like extensions in VS Code)


I think so too. Mostly that it is something open. I also believe that it will change the workflow a bit, and that DVC will play a major role in it for versioning your different data hypotheses. Let's see, exciting times ahead!


It would look like Datagrip without the ridiculous yearly subscriptions.


It's likely because I buy 'individual' license, so my costs are lower, but I'm not sure what's all that 'ridiculous' about the subscription. If you pay for a year, you get that version forever, just without updates to newer versions. VERY similar to ... back in the 90s... going to a store and buying a CD with software on it, and using it 'forever'. Eventually, it didn't work with newer stuff, so you upgraded to a new version (with more money).


That is a really fantastic model. It makes so much sense in retrospect. Kudos to all at Jetbrains!


Their subscriptions couldn't possibly be fairer. An annual individual license costs what, two hours of wages? And you get a perpetual fallback, and a 40% continuity discount. It's incredibly reasonable.


Even more "incredibly reasonable" behaviour: giving 30 day access to EAPs which always seem to be updated every 30 days anyway means that almost anyone could, with a bit of work, use their products for 'free' long term.

I know the price is going up in Oct, and I will likely 'buy ahead' 2-3 years at the current price now to save a bit extra.


Data-centric IDE screams like Excel!


I’m curious: what would need to be added to Excel to do this?


My guess (some if this we already have, some we don't): - automation: integration of heuristics (multiple columns that you can program via formulas and such) - exploration: finding outliers or most similar records given some reference (e.g. "I want to label more rows that are about business news in some extent") - monitoring - labelmanagement [which we don't offer yet in the extent we'd like to]: merging and splitting labels etc.

generally anything that scales and "somewhat" guarantees the users to input valid labels.

But it definitely offers something that new tools don't: users are super familiar with it.


Do NLP users use Excel naturally already?


Annotation platforms use Excel. I once received 25 files of separate Excel spreadsheets from a labeling service for 10k texts (short texts about product titles, e.g. "Sauvignon blanc" -> "wine"). Had to merge them, which wasn't as easy as you'd might expect.

Also, I once labeled 5,000 texts during my master's degree via Excel. Was painful as hell.


Thanks for the insight about annotation platforms!

How might Excel have been better for you in these tasks? Or put another way, in the first case of merging 25 files - did you wind up using a different tool to merge and then re-opening them in Excel? Was Excel limiting because you needed to do some kind of fuzzy matching against the labels, e.g., wine, Wine, white wine etc. to do the merge?

On the labeling task that you had - what might have made that easier for you? Some kind of custom scripting that's above and beyond what you can do in VBA?


Hey, you mention that it is open-source but I cannot actually see the source code in that repo.

EDIT: I can some sources in other repo of the same org, for instance: https://github.com/code-kern-ai/refinery-ui so it's just a matter of making it easy for dev to navigate the code.


Hi Ruben,

you can take a look at our architecture overview here: https://github.com/code-kern-ai/refinery#-architecture

A bit below it, you find a table with the links to all repositories. All of them are open-source. But thanks for the feedback, I'll try to make it a bit easier to understand! I appreciate that! :)



Looks nice! Which data types does the software support?


Hi Tom! Thanks, happy to hear that :)

We've focused on JSON as the user-specified data model. So you can upload anything fitting into a JSON. We're using pandas to process the uploaded data, so spreadsheets or CSV-ish also work.

We've got a public roadmap (https://github.com/code-kern-ai/refinery/projects/1), and we're looking forward to also integrate e.g. native PDF labeling sometime soon.


Excel


Maybe also PowerPoint?


Also Power BI


Looks awesome, quite fitting to my text labeling needs actually. Might try it out at some point for my next prototype


Amazing project! After testing it for a few minutes, I think this is a handsome tool. Smooth and functional.


Thanks Francis! Means a lot :)


That looks cool. Are there any similar tools? Did not have seen something like that before.


The most famous is arguably Snorkel, which started with an open-source library as a research project. We used that a lot ourselves, but the library by now is deprecated.

We aim to extend on that idea by providing something that comes as close to a programmable interface for data-centric tasks as possible, and do so via open-source.

There are lots of cool tools out there btw. in that area. Definitely worth to have a look at a landscape (haha idea for the next HN post incoming I guess :D)


Looks very promising! Are you planning to release it as an application?


Hey, thanks! :) What exactly do you mean with application? You can just pull the repository. As mentioned in the installation section, it is quite easy to start the app on your local host.


Looks like a managed application is already offered: https://docs.kern.ai/docs/saas-application


As a fellow HPI graduate, I wish you all the best :)


Certainly looks interesting! Will give it a try.


Nice, thanks. If you have any questions, please don't hesitate to contact us. Here's our Discord: https://discord.com/invite/qf4rGCEphW




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: