Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: dstack – an open-source tool to build data applications easily
134 points by kaudinya 11 months ago | hide | past | favorite | 29 comments
Dear HN,

I am Riwaj, the cofounder of dstack.ai (https://github.com/dstackai).

A few months ago, we built an online service that allows users to publish data visualizations from Python or R. The idea was to build a tool that did not require additional programming or front-end development for publishing data visualizations. Such a code can be invoked from either Jupyter notebook, RMarkdown, Python, or R scripts. Once the data is pushed, it can be accessed via a browser.

Open-sourcing dstack: During our customer discovery phase, we realized that dstack.ai should integrate a lot more open source data science frameworks than we integrated ourselves. For example, as a user, I want to push a matplotlib plot, a Tensorflow model, a plotly chart, a pandas dataframe, and I expect the presentation layer to fully-support it. Supporting all types of artifacts and providing all the tools to work with them solely seems to be a very challenging task. With this, we open-sourced the framework. Now you can build dstack locally, and run it on your servers, or in a cloud of your choice if that’s needed. More details on the project, how to use it, and the source code of the server can be found at the https://github.com/dstackai/dstack repo. The client packages for Python and R are available at the https://github.com/dstackai/dstack-py and https://github.com/dstackai/dstack-r correspondingly.

What’s next: User callbacks- so that application shows not just pre-calculated visualizations but also can fetch data from a store and process it in real-time. ML models- so that data scientists can publish a stack which binds together a pre-calculated ML model and user parameters Use cases- Support specific use cases that help data scientists to build data science models into data applications as fast as possible.

We would be happy to get your feedback on the open-source framework and also get your opinion on what kind of use cases can be built on top of the framework? Thank you.

I am trying to use dstack on my device and it still asked for a login information, which prompted me to read the terms and under "User Content" I notice this

``` You hereby grant to Company an irreversible, nonexclusive, royalty-free and fully paid, worldwide license to reproduce, distribute, publicly display and perform, prepare derivative works of, incorporate into other works, and otherwise use and exploit your User Content, and to grant sublicenses of the foregoing rights, solely for the purposes of including your User Content in the Site. You hereby irreversibly waive any claims and assertions of moral rights or attribution with respect to your User Content. ```

Are these texts common ?

Yes, pretty common. They are intended to avoid a situation where you post your own copyrighted [1] content to their site, then sue them for displaying it without a license.

The long list makes it seem very broad, but this phrase constrains it quite a bit: "solely for the purposes of including your User Content in the Site." This would prevent them from using your content in an ad, or selling your content to some other company, for instance.

[1] Under U.S. federal law, all content is copyrighted upon creation. I hold the copyright on this comment, and I have granted Ycombinator a license to display it on the HN site.

EDIT - here is the relevant sentence from the HN terms of use agreement. It's actually broader than the language you quoted.

> By uploading any User Content you hereby grant and will grant Y Combinator and its affiliated companies a nonexclusive, worldwide, royalty free, fully paid up, transferable, sublicensable, perpetual, irrevocable license to copy, display, upload, perform, distribute, store, modify and otherwise use your User Content for any Y Combinator-related purpose in any form, medium or technology now known or later developed.

does this apply only for contents posted on dstack data servers ?

Yes, this Terms of Use covers only the use of dstack.ai (hosted version) and doesn’t cover the open-source tool.

The open-source tool is fully covered by Apache 2.0.

The open-source library is licenced under Apache 2.0 and has not any restrictions mentioned under the terms listed on dstack.ai. However, the terms on dstack.ai which you quote also sound odd to me. Truth to be told, the current terms were generated by one of the common templates provided for startups. After we put them than, we didn't have a chance to review again. Now that you brought it up, it's certainly time to revisit them. We certainly don't want to claim rights over any of user content. The exception is probably using the published content by the website itself to show it to the users according to the user's sharing settings. Gonna review the terms and come back with an update.

The github README is a bit confusing to understand (for someone with no experience). The [1] landing page shows a much clearer outline on what this actually is.

Seems like it could be good for making data-driven dashboard graphs. Although [2] the react library looks like it needs a bit more work.

Congrats on shipping something though!

[1] https://dstack.ai/

[1] https://github.com/dstackai/dstack-react

Thank you very much. You're right, the README file needs improvements. We also don't have much tutorials yet that would show the tool from the practical point of view. Our short-term plans include: 1) Improving the documentation and writing more use-case specific tutorials; 2) Add more functionality for more interactive applications, including Machine Learning applications.

EDIT: Speaking of the react library, we've just finished a refactoring and plan to improve it too. Please don't hesitate to share your feedback, over email or via GitHub issues. And thank you!

UPDATE: https://github.com/dstackai/dstack-react is outdated and was fully merged some time ago into the main repository https://github.com/dstackai/dstack. So we will delete https://github.com/dstackai/dstack-react within a week.

I looked through github and see you have coded examples, can you also include the visual output so I can get a sense of what it'll achieve without having to set it up to see?


1. Here's the most simple tutorial how to make an interactive dashboard and share it: https://docs.dstack.ai/tutorials/dashboards-tutorial It includes screenshots.

2. Here's another tutorial with more realistic data: Output: https://dstack.ai/gallery/d/b56128a3-522e-42d7-8662-9b1a768d... The code for it is available at https://github.com/dstackai/dstack-tutorials-py/blob/master/...

Actually we have very few examples. We gonna make more of them within this week.

Thanks for sharing!

Can you elaborate on “What’s next”?

> User callbacks- so that application shows not just pre-calculated visualizations but also can fetch data from a store and process it in real-time.

How are you envisioning this working? Will dstack be like a database? How will “user callbacks” be triggered?

Hi, thanks for the question. This feature is still in the design stage. The idea is pretty simple. Currently, you can push a pre-calculated visualization and associate it with particular user input. However, in many cases it's not possible to recalculate all possible combinations of user input in advance. That's why we'd like to let user push not a visualization but a function that produces a visualization. This function will be triggered when the user changes input. Such a function can do a visualization on the fly and if needed take the data from an external source.

Looks great, how do you compare to other alternatives in the same space?

Hi, I'm Peter, a part of the team. There is quite a few solutions already that try to make it easy to make data applications with Python or R. The most relevant solutions include Plotly Dash, Shiny, Voila, Streamlit. All of them are great projects even though all of them are very different. Our project is an attempt to explore this area and figure out what would be a way to build these applications without having a need to do programming, CSS, HTML, or deployment.

Basically, we want to make it possible to make data apps as simple as writing a few lines of code using only the libraries that data scientists already know - pandas, Matplotlib, scikit, Tensor, pytourch, etc. Ideally so you don't have to write your application code at all, and rather deploy your data science models and simply bind them with a simple UI logic. We believe the need to apply ML to enterprise use-cases will grow even more and tools like that will be very useful. Basically you'll be able to create an application that help your HR/Sales/Marketing/Product/<you name it> department to apply ML – in minutes, without the need to write this application, deploy or maintain.

"No-code FE for ML"? Sounds awesome to me.

What alternatives come to your mind? I'm looking for a solution like that and would love to get some insights.

streamlit.io is one that i've just finished trying, and find it easy to use.


plotly dash


Does this need some License information adding to your repo? Something that protects you from someone taking this and running their own paid hosted version :) I take it it isn't MIT!

We open-sourced it under Apache 2.0 which we find quite permissive and OS-community friendly. You're welcome to run a hosted version. We actually have it running on dstack.ai (it's free currently but we of course plan to have paid features if there is such a need).

Your website theme seems to be similar to many others lately re: the cartoon figures etc. Is this a free/paid/opensource template?

The one on the website if I recall correctly was mostly designed in house. Gonna double check that with the designer. The blog post image was taken from https://icons8.com/illustrations (free). There are actually a lot of free illustration libraries nowadays. One of my favorite is https://undraw.co/


how does this compare to dash by plotly, or r shiny, in terms of the intended use case?

It looks like you are more of a wholistic platform, including a workflow scheduler etc.

One things is certainly that we would like our tool to be agnostic to data science tools and work with all of them. So you can use pretty much any visualization or ML library.

Another thing is that we’d like to eliminate the need to do any programming or HTML/CSS as much as possible.

Th jobs that are available as a part of the hosted solution is not yet part of the open-source library but this is certainly something for us to consider moving under open-source too.

We are currently at quite an early stage and a lot of work is still ahead. We’ll appreciate any feedback and suggestions on where to steer the roadmap.

Gonna work on preparing more use-case specific tutorials within coming weeks.

How do you deal with differential versioning of code and data, and the fact that people don't always execute notebooks from top to bottom?

For instance, suppose I have a notebook that takes 2 hours to generate a model. From the viewpoint of explaining it I'd like to make a notebook where I start from the beginning, train the model, then use it.

If I want to show it to people I want to save all the results and re-render them, not rerun the calculation, certainly if I want to show off the results in a 1 hour talk!

From the viewpoint of reproducibility, however, you have to be able to run the notebook from top to bottom and get a 'correct' result. I'm not going to say the 'same' result because many calculations are stochastic in nature (e.g. random numbers) or because often the data changes. (Let's say I have somebody make a notebook that does April's sales reports -- shouldn't I just be able to point it to the may data to make May's sales reports?)

Between the long time delays (longer than people can hold a context in their mind, longer than they want to wait) for the system to settle down and the total complexity I find that many people involved with data science violently resist confronting the above issues. The effects are much like the visual "blind spot" -- you might get a series of projects that were 98% completed but didn't quite deliver business value although everybody feels like they did their part.

Like other vendors in this crowded space, dstack leads with technology as the key problematic "e.g. supports Python and R", "matlib, Tensfolow, plotly, ..."

It's certainly true that people don't want to face up to reality in that area. Maybe 50% or 90% of the "waste" in the area involves setting your dependencies up, begging your boss to get you access to "the cloud of your choice if that's what's needed". The trouble with is that investment in particular technologies are of temporary value (maybe people will still be using R in 2030, maybe they won't be using Tensorflow, almost certainly plotly gets bought by Google and shut down by then)

Years back I researched the problem of running Tensorflow models that we got off the pavement, building a database that says TF version X depends on CUDA version Y, CNN version Z, and being able to have multiple copies of the userspace GPU drivers installed simultaneously (e.g. just put 'em in a directory and set the library path to point at 'em -- don't even need containers!)

I could have sworn Google looked at my source because they did the one thing that could have broke that strategy. Also the company I was working for lost interest in that particular shiny thing. That's a basic problem with maintaining a distribution of other people's software -- like treading water it takes effort just to stay in one place.

The more fundamental problems that turn up in going from data to decision and products are eternal and not tied to a particular technology. If you solve those problems rather than chase the shiny you might break out of the pack.

I agree with your point. Reproducibility and versioning is an important yet ver challenging topic right now and not many seem to help with it. And it might be that the problem is not specifically about tools but rather the mindsets and workflows.

IMO dstack is a lot about process. Technologies can change. The process often stays. We’d like to find the best way to solve problems people face every day regardless a particular technology.

One more little thing which might be relevant is that dstack actually tracks revisions. What we haven't figured yet out is how to link the particular revision of the applications with the particular revision of the code / notebook.

Nice initiative

Applications are open for YC Winter 2022

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact