People don't have enough time to step back from their tools and think about what the ideal thing would look like.
I'm convinced this entire space should be visual. We always visualize data pipelines in our minds and on whiteboards...well, visually. Two-way code-diagram syncing is needed to allow best of both worlds. We should serialize to yaml from imperative code, but allow manipulating the diagram to modify the imperative code.
Ideally what you want is also to track the dependencies of every atom of data in your org, and then have something cache and incrementally compute updates.
The biggest thorn in this vision is SQL and the relational model, and I don't think a lot of people realize it.
It favors representing data such that the query planner can optimize query execution...rather than being able to track data dependencies and visualize data flow. It wasn't designed in mind with complex ETL pipelines and many external data sources of today's world.
As someone that is completely non visual, I personally don't buy it. Trying to keep things visual will almost certainly choke on all of the extra information that comes with being visual.
I think it is easy to lob complaints at sql. The problem is every successor seems worse. Teams I've been close to lately have taken the python pill. Pandas, they push, is so much easier to understand. This works too an extent. But what we have dodged large sql files, we now have giant python files. Worse, they are trying to engineer things in "logical" ways for code sharing.
What I'm seeing is they moved all of the complexity of most ETL engines into each python file. This feels great at the start, but is like how every Todo list project looks clean in basically any language. Ramp up the variety and size of data sources, and get ready for that clean idea to turn into the mess it thought it prevented.
For my two cents, if your ETL setup doesn't factor in Conway's law in who owns different parts of the data, it will fail in time. Trying to hide split ownership of data is a sure sign that changes will get misunderstood and data will be lost or misused.
> all of the extra information that comes with being visual
I think the key is to not make everything visual. You want to be able to view your system as you would draw it on a whiteboard. I believe it could be done using a restricted syntax of a proglang.
> Teams I've been close to lately have taken the python pill...
Most people have a desired result and want it as fast as possible.
When imperatively using map, groupBy, filter, you can usually get the immediate result you want very quickly. This is the appeal of Pandas, Python notebooks, etc.
But when someone asks: I want to re-compute on change, which pretty much applies to every task...their quick solution is not optimal.
And then if the schema changes, and you're doing all this intermediary stuff...then refactoring is next to impossible.
Any SQL query could also be written in an imperative prog lang and its probably easier to understand and tweak and also trace. An SQL relation does not maintain any info about its source, which then makes incremental updates difficult, and also refactoring.
You can basically ask the question: for every piece of data you see in your company, can you easily see the "data lineage" from its source and all the transformations it was involved in, and how clean and efficient is this pipeline when receiving new data. Then a step further is, how easy is it to change something.
The problem with, "as you would draw it on a whiteboard," is that much of what makes whiteboard drawings useful is his much you omit.
Such that I agree being able to visualize things is nice. But don't confuse the visualization with the system. A step you have to make in many visual tools.
I'm pretty much agreed on most of what you wrote, though.
And the very valid counter claim from folks that do some of this is that there needs to be some mechanism to keep things synchronized between the visualization and the execution.
> I'm convinced this entire space should be visual.
At my last 2 jobs I spent entirely too much time debugging Matillion jobs, which are visual. I have my doubts that it’s the panacea that it appears to be.
Thanks for the link. Enso looks very cool. The traceability aspect is nice. I'd never want to use a visual interface to "code" though like they show in their intro video. They have the two-way sync to code-diagram which is cool, and its to an imperative language, but it's to their own strange programming language. I just can't see myself coding Enso. If it were something familiar like Python or JS - then now we are talking!
My first job was building pipelines in informatica's ETL tool and Pentaho. As far as I can remember, It used to be very visual back in the days. Hadoop and large scale pipeline came into picture and all of sudden these pipeline were super complex to manage, nobody knew what was happening, and we were riding big data hype. With scripting language, it is easy to test out these pipelines and showcase value before deploying them. Secondly, I think there is some sense of satisfaction or accomplishment when people script a pipeline than visually building them, even if it might more efficient in some cases. So, visual representation is a sexier pitch but it might be a hard sell.
>I'm convinced this entire space should be visual. We always visualize data pipelines in our minds and on whiteboards...well, visually. Two-way code-diagram syncing is needed to allow best of both worlds
I'm also convinced by this! Currently building something like this at https://www.ilumadata.io/
We have a pipeline builder (not yet on our website) that lets you visually build out the data pipeline. At the same time it's creating the corresponding files for each node which you can switch to.
No, that isn't correct. The query planner optimises query execution in a layer of abstraction below the representation of the tables, the relational integrity and structure of the user-accessible objects.
Tables, their attributes, relationships etc. are easily able to be represented visually. Entity-Relationship Diagrams have standard notation (Chen, Crows-foot, UML). We can represent models conceptually, logically or physically, we can even indicate the existence of views and indexes. In terms of queries, DML is essentially multi-filtered, multi-dimensional slicing of tables and can be represented as output sets which are as modellable as the source tables.
On the wider point, DBT looks promising. It seems to be a good middle ground between the purists (like me) who wish back-end devs would just learn how to model data, and the adventurists who prefer to lock everything behind OOP principles.
The query planner can always change how it runs and is a black box. Nodes in the plan thus cannot be cached. This means that for streaming we are usually re-running the entire query, or doing some custom stuff.
When a source table changes, I want it to automatically and efficiently update anything that would change. I think pretty much every system would prefer real-time stuff like this.
If you are doing this with SQL you will start looking into "Incremental View Maintenance" which is quite complex and still quite heavy.
Then you realize that if you take control instead of handing it off to the planner, you could code your joins and transformations by hand, cache intermediary steps as needed, and have a query that is more efficient than SQL.
But for this I would argue you need a better way to visualize your data flow and the dependency graph, because people can easily write slow imperative stuff.
I'm convinced this entire space should be visual. We always visualize data pipelines in our minds and on whiteboards...well, visually. Two-way code-diagram syncing is needed to allow best of both worlds. We should serialize to yaml from imperative code, but allow manipulating the diagram to modify the imperative code.
Ideally what you want is also to track the dependencies of every atom of data in your org, and then have something cache and incrementally compute updates.
The biggest thorn in this vision is SQL and the relational model, and I don't think a lot of people realize it.
It favors representing data such that the query planner can optimize query execution...rather than being able to track data dependencies and visualize data flow. It wasn't designed in mind with complex ETL pipelines and many external data sources of today's world.