Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Glue – Pandas as a DAG (Now a web app) (gluedata.io)
46 points by gthompson1 on Aug 23, 2022 | hide | past | favorite | 28 comments



Hey guys,

I posted this maybe a year ago. It was originally an electron based desktop app which was cool but pretty hard to maintain / get people to download and test it out. Given that there would likely always need to be a cloud component regardless I decided that it would be easier to port it over into a web app and maintain it that way.

Since doing that I took a job and am pretty burned out on the project. I was thinking about open sourcing it and seeing if it gets more interest that way. Potentially as a self contained docker image that folks could pull down and run.

Glue is pretty similar to these commercial projects https://www.alteryx.com/, https://www.dataiku.com/, https://parabola.io/ and other DAG based data pipeline generators. I also saw this posted on hackernews within the last year or so https://datablocks.pro/ which is pretty cool too (also more of of an indie project).

Here is the current API. Probably needs some de-scoping so that its only the important endpoints but its a look under the hood if anyone is curious. - https://app.gluedata.io/docs#/

Anyway if anyone has any comments, suggestions / feedback please let me know. Happy to share any details or thoughts.


If you think you were burned out before, becoming an open source maintainer probably won't help :) If I were you I would put out a call for maintainers, and maybe consider a license that keeps modifications open source. That way a company that modifies it will have to contribute back to the project. If you do open source, I recommend moving your roadmap to a GitHub Project, put together a contributors guide/agreement.

Companies will definitely be averse to using a project due to trademark or licensing issues. Apparently somebody is using the name (https://alter.com/trademarks/glue-85698439). You could probably find a way to signify your use is distinctive from that other one, and you may want to register it to protect it. IANAL.

I think it has a lot of potential, but finding people who want to build it out/run the project will be the challenge.


"If you think you were burned out before, becoming an open source maintainer probably won't help" - Ha yeh I believe that.

"That way a company that modifies it will have to contribute back to the project. If you do open source, I recommend moving your roadmap to a GitHub Project, put together a contributors guide/agreement." - Thanks for the advice


Very cool. A GUI pandas. As a seasoned user of pandas, the way it works seems clear. I get the potential appeal of a no-code solution like this: "Glue is to pandas as Airtable is to relational databases." The question in my mind: how many users are there that want to build these kinds of data transformation pipelines that are also unwilling to learn Python/pandas (or R, or some other equivalent)? Because once you need to build a custom lambda, or do something downstream of the transformation, like a custom data visualization, you need to know how to write code. Which I think gets to the heart of the matter: this replaces the easy part of data science. The hard parts are things like understanding what the columns really are, cleaning up messy data, and doing statistical modeling that yields valid insights. I think open-sourcing it is a good idea. I can see how this could be added to a more complete software ecosystem that is designed to be used by non-coders, like an in-house ELN for a biotech. But as a paid project, I'm almost surely not going to ever give it a try.


Just a note: the first thing I did was check to see if the project was open-source, and that was before reading your comment indicating you were considering publishing the source. I did that because I'm not willing to get locked in to another walled garden.

That does not, of course, imply that publishing it with a FOSS license is the right move for you! I just thought it might be helpful to report my behavior.


Yeh I get it. I appreciate the comment. Part of the reason I am considering open sourcing it would be to give the project more life than I can currently give it myself.


Just wanted to say: nice job! :)

I can fully understand you burning out on something like this. How about declaring a time-out period where development is paused while you take a break? Maybe triage bug reports and feature requests in the meantime, but take some time to yourself to recover and reset your perspectives.

Just a thought.


Thanks! Thats kinds of you to say. Appreciate that.


This site makes my phone slow down to a crawl, to the point that my music app stopped. And it's just the landing page. This is insane. What is the resolution on those auto-playing videos?


Way too high. I haven't updated the landing page in a while. I need to. Sorry.


What is the expected inputs and outputs? How do you see this fitting into a users workflow?


I'm not totally sure what the question is but currently these are the inputs:

delimited, excel, json, feather, parquet, sqlite, postgresql, mysql, googleSheets

Outputs are the same (excluding databases currently). This is probably the most time consuming piece of the project if I was to expand this set dramatically given the quantity of possible connections, although here is where open source could help.

It's hard to know how technical or not technical to take this. e.g. could there be a python input e.g. arbitrary python script to pull in data? That would allow for basically any input or output. But for non technical users thats a harder sell... open to ideas.


Target data scientists and you'll have a very large segment of users. They aren't programmers but they deal with data a lot.


I think that is OP's intent. Data Scientists, Data Engineers, Data Analysts, Data X


Yeh I still appreciate the callout / opinion though. It's not obvious to me who is best to target. As a full stack engineer with analytics experience one of the main uses of the tool is scheduling, running pipelines remotely vs the UI for data munging which I can write pretty easily. But for non technical users the data munging piece might be really helpful (e.g. no code). Its a little hard to serve both a marketer and a developer given the needs are so different. But then on the other hand I assume a marketer is more likely to pay for the tool (and thus keep it going) vs a dev who might be a harder sell. I built this tool with the hope of being able to serve both which is where maybe I bit off more than I could chew.


I'd say you should still work toward whatever your own interest with the tool is. If it seems like too much work, make a list of all the functionality you want, figure out the easiest way to implement each, and work on the least-effort/highest-reward things first. And do whatever feels fun!


Would you be able to leverage singer taps / targets?

https://www.singer.io


Oh this is very interesting, I had never heard of this. Yeh I could potentially use this. I do think using a third party library or provider makes sense to massively expand the input / output options. Seems like there is a standard schema for the different tap configs that I could pull and wrap in UI forms in a generalized fashion.

I don't know if you think these are fair comment about the state of the project? - https://www.youtube.com/watch?v=TBrSOPNEg-g&ab_channel=Resta...


I actually use Stitch so I don’t have to host orchestration, which is how I came to know about Singer in the first place.

It’s true that Stitch got acquired by Talend, but Singer seems independent enough. I don’t understand why Stitch would “abandon” Singer. They’re a sponsor, and seems like they have an incentive to keep that ball rolling. Their bread and butter seems to be hosting orchestration, so more taps means more customers for them. There’s other stakeholders in Singer, too, like Meltano, and all the self-hosters.

The GitHub activity seems robust to me, but I don’t really have any in depth knowledge about the Singer community. Saying it’s “dying” doesn’t seem accurate, though.


Repo for star purposes (no code) [1]

[1]: https://github.com/gjthompson1/glue-public


Did you use a particular library for the drag and drop graph building UI? Woukd like to know as I could use something like that for different project


Hey! Yeh I did https://g6.antv.vision/en although I think I might try using this https://x6.antv.vision/en if I could start again. X6 wasn't as well built out when I started though but its probably more suited to my use case with its interaction support. If you have other libraries that serve this purpose let me know because I did a lot of searching and thats the best I found at the time.


Digging around a bit I must say I am very impressed by all the data-viz stuff the antv group makes. Would really like to be able to try out their GraphInsight tool (https://github.com/antvis/GraphInsight/blob/master/README.en...) but it seems to Chinese only for the moment


Thanks! Looks great, I hadn't heard of it


I guess it targets non programmer users and also without math knowledge, so why use lambda and sigma symbols ?


If you have other suggestions for the tool symbols let me know. With this project the right balance between technical and non technical is hard to hit.


Are you aware of AWS Glue?

(If nothing else I think its existence will be bad for your SEO/Stack Overflow Optimisation.)


Haha yeh... I have been told the branding / naming is bad for that reason which is fair. If you have ideas I'm all ears ;)

But yeh good points on SEO / stackoverflow.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: