Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Visualize the entropy of a codebase with a 3D force-directed graph (github.com/gabotechs)
180 points by gabimtme 10 months ago | hide | past | favorite | 59 comments
Hi HN! I'm Gabriel, the author of dep-tree (https://github.com/gabotechs/dep-tree), and I wanted to show off this tool and explain why it's being really useful at my current org for dealing with code complexity.

I work at a startup where business evolves really fast, and requirements change frequently, so it's easy to end up with big piles of code stacked together without a clear structure, specially with tight deadlines. I made dep-tree [1] to help us maintain a clean code architecture and a logical separation of concerns between parts of the application, which is accomplished by: (1) Visualizing the source files and the dependencies between them using a 3D force-directed graph; and (2) Enforcing some dependency rules that allow/forbid dependencies between different parts of the application.

The 3D force-directed graph visualization works like this: - It takes an entrypoint to the codebase, usually the main executable file or a library's entrypoint (index.js, main.py, etc...) - It recursively crawls import statements gathering other source files that are being depended upon - It creates a directed graph out of that, where nodes are source files and edges are the dependencies between them - It renders this graph in the browser using a 3D force-directed layout, where attraction/repulsion forces will be applied to each node depending on which other nodes it is connected to.

With this, properly decoupled codebases will tend to form clusters of nodes, representing logical parts that live together and are clearly separated from other parts, and tightly coupled codebases will be rendered without clear clustering or without a clear structural pattern in the node placement.

Some examples of this visualization for well-known codebases are:

TypeScript: https://dep-tree-explorer.vercel.app/api?repo=https%3A%2F%2F...

React: https://dep-tree-explorer.vercel.app/api?repo=https%3A%2F%2F...

Svelte: https://dep-tree-explorer.vercel.app/api?repo=https%3A%2F%2F...

Langchain: https://dep-tree-explorer.vercel.app/api?repo=https%3A%2F%2F...

Numpy: https://dep-tree-explorer.vercel.app/api?repo=https%3A%2F%2F...

Deno: https://dep-tree-explorer.vercel.app/api?repo=https%3A%2F%2F...

The visualizations are cool, but it's just the first step. The dependency rules checking capabilities is what makes the tool actually useful in a daily basis and what keeps us using it every day in our CI pipelines for enforcing decoupling. More info about this feature is available in the repo: https://github.com/gabotechs/dep-tree?tab=readme-ov-file#che.... The code is fully open-source.




This is really cool. And as OP pointed out, I really like the pipeline integration. Like when linting catches function-level complexity, but in a cross functional way. I prefer to think of programs in layers where the top layers can import lower layers, but never the other way (and also very cautious on horizontal imports). Something like this would help track that. Unfortunately, I'd really need to support Go. I find it interesting the the code is written in Go, but doesn't support Go. But I will watch this project.

From the visualization perspective, it reminds me a lot of Gource. Gource is a cool visualization showing contributions to a repo. You see individual contributors buzzing around updating files on per-commit and per-merge.

https://github.com/acaudwell/Gource


The visualization is actually inspired by Gource, but taken to the 3D space, it's a really cool project.

Golang is very challenging to implement, because dependencies between files inside a package are not explicitly declared, you can just use any function from any file without importing it as long as they both belong into the same package, so supporting Golang would probably require spawning an LSP and resolving symbols.

The reason for implementing dep-tree in Go was because things were going to get algorithmic af, and better to choose a language as simple as possible, knowing that it also needed to be performant.


If Go treats all files inside a package the same, maybe you should use packages as the "unit" in Go instead of files? That would probably still be useful, at least for bigger projects...


Yeah, that's an option, it's not a perfect fit with the philosophy of the project, but definitely possible. But ideally it would just work between files in a package.


Especially since Go has a culture of backwards compatibility and the API doesn't change if you move files around on a package.


Another 3D code base tool is Primitive they were pushing almost into IDE territory but not sure anything got beyond beta but maybe with Apple Vision Pro they might take another swing.....

https://primitive.io


A tangentially related tool you can use to look at a repo over time is Git of Theseus[1]. It shows things like "what percentage of the code in this repo survives 6 months.

[1]https://erikbern.com/2016/12/05/the-half-life-of-code.html


That's really interesting!


This is cool, basically the first 3D codebase visualization I've seen that doesn't immediately give me a headache, so good job! :)

Always interesting to see different ways of visualising the same thing. A while ago my friend and I also made a codebase visualisation tool ([https://www.codeatlas.dev/gallery](https://www.codeatlas.dev...), but instead of taking the graph route, we opted for Voronoi treemaps in 2D! It's a tradeoff between form and function for sure, modelling code as a DAG is definitely more powerful for static analysis. However, in most graph-based visualizations (this, gource) I just find myself getting lost super quickly, because the shapes are just not very recognisable.

Really impressed by how polished this already is, nice docs, on-the-fly rendering, congrats!

If I ever find time to work on codebase visualisation again, I might have to steal the idea of codebase entropy to better layout which files to place close to which others!


Ooops, I should take more care pasting links from markdown, this one works: https://codeatlas.dev/gallery


I've always felt like instead of public, private, protected, there should be something like security groups and acls on classes and functions. That way it's very explicit when you are newly coupling things, and brings tighter scrutiny to those changes.

Edit: oh, looking at the docs, apparently that's exactly what this tool does. Though it would be nice to have function level granularity. Maybe by annotating the code itself.


Build systems like Bazel provide mechanisms for controlling access at the module-level. If you're disciplined about not just making everything "public" it can be really powerful. Bazel is a very big hammer though and might be overkill for your projects.


Oh, interesting! We already use bazel at work but it was all set up before anybody on the current team got here and the only process we have is "fight it until it works", which fortunately doesn't come up too often. Or maybe _unfortunately_, as then there's not enough incentive to figure it out. Now you gave me inspiration to dig in and understand how it should really be used.


Is this just using the word 'entropy' as a stand-in for complexity or is there some actual definition of entropy involved?


Nah, nothing like that, "entropy" in the colloquial meaning of level of disorder, it has proven to be a useful word for people to understand what it is about, even though it's strictly incorrect.


This might be the first time someone used the word "entropy" colloquially in the hopes that it is generally well understood what it means.

To me the usage makes no sense whatsoever, there's no entropy to be found here, and if there is it's not what is being displayed. Perhaps it makes sense to people who don't know what entropy is, though most of them wouldn't know the word in the first place.


Same here. My first reaction to this was "how does this compute entropy at all"? Also, entropy and complexity are two different, if related, concepts, both mathematically and colloquially speaking.

It's abuse of technical language in efforts to sound impressive, in my opinion, which I guess is a valid form of language evolution. There are other words in our lexicon that have become general purpose and fuzzy in spite of their precise technical origins.


Great tool for visualizing complexity and dependencies. Entropy is the wrong term, though. For me (and likely many others) it conjures Shannon's seminal definition as an information measure. Entropy is a number.


please don't do that, especially when presenting something that uses graphs, as entropy on graphs is an actual technical concept that's currently widely used in very hot fields.


Why not simply say "disorder" then and:

a. avoid making certain subsets of people think you're using precise concepts that you aren't. b. make it easier for people that don't know what entropy even is to understand what this tool does. Disorder is a far more widely understood term in my view.


Thanks for the feedback y'all, I'll take this into account going forward


It would be nice if Cpp was supported. A lot of large legacy codebases written in c++ would be interesting to visualize.


Would it work to support doxygen import thereby getting several major languages at once?


definitely, that and Java sounds like two very good candidates.


Could it be, that this can't check absolute imports? My python project, has many files which depend on each other, but are not linked together in the generated graph. But one of my modules has a __init__.py with relative imports, and this shows links between the files imported in the __init__.py.

Lets say my project looks like this:

src/example/foo.py

src/example/bar.py

And If bar.py containse the statement "from example.foo import Foo" there is no link between the files foo and bar. Though, if the statement is "from .foo import Foo" it shows a link.


That's because dep-tree doesn't know it needs to resolve names starting from `src/`, as your imports have that piece of information trimmed. You can solve this by setting the PYTHONPATH env variable like this:

export PYTHONPATH=src


Perfect, that worked, thank you!

I thought this could be solved by changing the directory to src/ and then executing that command, but this didn't work.

This also seems to be an issue with the web app, e.g. the repository for the formatter black is only one white dot https://dep-tree-explorer.vercel.app/api?repo=https://github...


Yeah, the web app is quite limited, it doesn't accept any kind of configuration. Implementing the Python absolute path resolution mechanism was actually quite challenging, as there is just too many ways you can handle absolute imports.

I've seen people using tricks like the `sys.path.extend(["src"])` in the main file for being able to place source code into an `src` folder, but unfortunately, dep-tree is not able to take that into account.


it's cool but half the battle. To keep an eye on decoupling you need to map where the state goes. For web, what parts of the code are making using fetch / xmlhttprequest. using the URL & params, history. What's using local storage etc. should be able to identify those browser APIs and draw them out like a dep link too. I just had to fix a component that's was directly editing URL parameters instead of the store which updated the URL.


This is gonna sound weird as hell, but I would really dig implementing this on a doc repo with CCS (component content), where you re-use document modules[1]. Why do I care? Because some modules support way too much complexity, and entropy is a pretty good measurement of that.

[1] Asciidoc/RsT (include directive for both), XML (DITA/S1000D/DocBook/etc, each with different transclude mechanisms), any markup that supports transclusion.


I was recently working with collection of Rust libraries with poor dependency management. Some dependencies wouldn't compile for certain platforms. In most cases these features were totally unnecessary for my usage.

Would love to see a tool that could automatically break these dependencies into optional features within their crate. It felt like a poor use of my time to track everything down manually.


Rich hickey has a nice talk about this exact problem. He uses this scenario on why “classical” dependency management is flawed - you might only want one function of a library that has no dependencies itself, but you have to import the whole thing.

https://youtu.be/oyLBGkS5ICk?si=cawjnPnR9riEyvf2


Very pretty!

Out of interest, I'm thinking how this sort of method works if you ignore the semi-arbitrary distinction between your own code and other libraries. If, say, an array class is used everywhere, wouldn't that look like a bad pattern on the dependency graph? Or is there a way to read the graph that tells you that your pervasive use of np.array is still appropriately decoupled?


That's taken into account while rendering the graph. The attraction force between two nodes is inversely proportional to the number of edges a node has.

If a node is depended upon a lot, all the resulting edges induce weaker forces to adjacent nodes, so this accounts for the fact that some files will be depended upon a lot, and that's fine.

There's also the option to just exclude that kind of files from the analysis with the --exclude flag. I've found that to be useful for massive auto-generated files.


While excluding nodes which a huge portion of the code depends on is one solution to make the graph less messy, I think an interesting alternative would be to allow certain nodes to be duplicated. If the "energy" of the system could be reduced above some threshold by duplicating a node, duplicate it and connect the edges to minimize the "energy". Alternative let me configure that these nodes can be copied N times.


A friend of mine developed a tool chain with coworkers to try to systematically improve code quality on a big Java project in its day. https://xradar.sourceforge.net/ some off the ideas might be useful for you. I think there is also a link somewhere to the paper they wrote.


Off topic but...

> I work at a startup where business evolves really fast, and requirements change frequently, so it's easy to end up with big piles of code stacked together without a clear structure, specially with tight deadlines

That smells.

It sounds like the team could benefit from better stack technologies and a bit more discipline in how it is applied to solutioning.

> Enforcing some dependency rules that allow/forbid dependencies between different parts of the application.

What is the alternative to this tool that lowers the cognitive barrier / builds the right muscles for the team to understand what they should / shouldnt depend on?


> It sounds like the team could benefit from better stack technologies and a bit more discipline in how it is applied to solutioning.

For our specific case it's actually pretty good, we've built a lot of discipline around maintainability, but in general this is a recurring problem in tech teams who might not be able to afford the time it takes to gain discipline.

> What is the alternative to this tool that lowers the cognitive barrier / builds the right muscles for the team to understand what they should / shouldnt depend on?

Some programming languages allow you to split the codebase into modular units (npm workspaces, cargo workspaces, etc..) which forces developers to modularize things, and dependencies between modules need to be explicitly declared.

This is good, but usually not enough, as nothing prevents you to mess things up within a module/workspace.

There's some other tooling with similar functionality to dep-tree, but language-specific and with visualizations not suitable for large codebases (.dot files, 2d svgs...)


Indeed, and tools like dep-tree provide a combination of 1) making module structure visible 2) making rules about this structure concrete and 3) automatically checking for rule violations.

These all help to lower the cognitive barrier to learning and maintaining the code base effectively. For developers new to the code base they help with learning and for those more experienced they help with ongoing design and maintenance.

Most long-lived code bases I've seen have adopted or built such tooling at some point, often with tools customized to the code base. For example in one large code base (c. 250 devs) we built tooling that simulated and helped optimize the changes to implement a major refactor of the overall module structure.


Stack technologies tend to bound contexts based on technologies and not on domain boundaries.

This is why we see all these products targeted at companies with 24 microservices with 26 developers who have to run end to end testing on everything.

Architectural erosion is primarily a cultural issue and any tool that helps people discover and call out architectural violations is potentially useful.

Many companies can't just do the inverse Conway law, and if you look at the state of devops report, note how they call out CAB forums and controls being problematic for even high performing companies to become elite.

This product as an example, which just really means you want to keep k8s but have given up on loose coupling and high cohesion.

https://www.signadot.com/blog/how-uber-and-doordash-enable-d...

Throwing products at structure problems typically doesn't work.


It's extremely common to get things twisted up. Even if there is a good tech lead, that person may not be good at writing documentation, may be too busy writing code, and may not yet have a plan for how to keep things organized.

Maintaining a code base requires communication, PR reviews and discipline. That doesn't always happen.

Having lint check rules is brilliant. Never mind discipline, you just need a friendly error to say don't import services into an ORM model file. I'm going to adopt this right away.


And even with discipline, sometimes introducing tech debt in order to ship something fast is actually something desirable at the short term, specially in the startup world, so I don't think that anybody with deadlines is completely free from twisting things up.


There’s such a weird vane of do nothingness that runs through this comments attitude. Yeah of course it’s easy to pick dependancies when you don’t worry about deadlines. A programmer without a deadline is like a fisherman going to grocery store to buy fish and claiming it’s “best practices” better results, but what was the point?


I "think" I understand what I'm looking at - it's like a 3d dependency tree with added flow of exports -> imports? It certainly looks very pretty![1]

One piece of feedback, if I may. It's really difficult to read the blue labels against the black background. Is there any way to change the palette colors?

[1] https://dep-tree-explorer.vercel.app/api?repo=https%3A%2F%2F...


Well, that's one of the drawbacks of the smart color auto generation... it's not that smart.

That's definitely is an improvement point, I have just calibrated things looking at my screen, which might have a high saturation/brightness setting.

Thanks for the feedback!


This reminds me of doxygen diagrams - https://doxygen.nl/manual/diagrams.html


It's a similar idea, but I often find myself very lost on 2d drawings if the codebase reaches a certain size


Pretty cool -- sadly I think this doesn't catch custom `imports` patterns in my package.json[0] so my graph is incomplete

___

0. https://nodejs.org/api/packages.html#subpath-patterns


Yeah, unfortunately custom imports are only implemented if declared in the tsconfig.json as path overrides, but definitely something that should be looked at


This is really cool! We are recently developing a project with heavy C++ and maybe a little Python scripts & wrappers and we are planning for a major refactor. Is it possible to adopt this with a C++ codebase?


Right now it only supports JavaScript, TypeScript, Python and Rust, but it's designed to be extended with any other language. Each language implementation is just some hundreds of lines of code, so it's "easy" to add new ones, I think C/C++ and Java/Kotlin are good candidates that would be very easy to implement.


Just tried it with my C project. Entry point extension is not supported. :(


Love it, I think dependency trees are super underused data for static analysis.

The visualization here is amazing in its own right as well, can I ask what part of the codebase renders it and handled the force-directed part?


The portion of the code in charge of rendering lives inside the `internal/entropy` (https://github.com/gabotechs/dep-tree/tree/main/internal/ent...).

Force-directed is an algorithm for displaying graphs in a 2d or 3d space, which simulates attraction/repulsion based on the dependencies between the nodes, the wikipedia page explains it really well https://en.wikipedia.org/wiki/Force-directed_graph_drawing

> Love it, I think dependency trees are super underused data for static analysis.

Definitely, specially for evaluating "the big picture" of a codebase


I could use something like this for large Java projects.


Java is one of the top candidates for being implemented next actually


Great tool!

React's graph looks like a mess. Why am I not surprised...


This is nice work on graph visualization, but we learned years ago that readable network visualization does not necessarily mean good software architecture. For example, a good drawing of a tree may be easy to read and even beautiful, but may reflect an underlying design with no re-use or modularity. A graph of relationships between functions in an abstract machine may look very complicated, but that doesn't mean the design is poor.

Graphs are wonderful abstractions for the structures that arise in many kinds of engineering, but you need to focus on understanding those abstractions, not just pictures rendered by heuristics. Visualization can be wonderful, but has its limitations, especially when used out of the box.


For sure there are always exceptions to the rule but I've worked on many projects and I'm yet to encounter one which could not be made modular and loosely coupled.

For some projects, you need to think really hard to design it correctly. The most extreme experience that I've had of this was when I was working in the blockchain sector.

Initially, when I joined, the project was a tangled mess. Every module was connected to many other modules without clear separation of concerns and with tight coupling.

For the refactoring, we extracted the core crypographic logic and separated it from the network/P2P logic which we re-wrote from scratch. I designed it so that the P2P module would be fully data-agnostic; meaning that it would have no concept of what kind of data it would have to propagate through the network. It was a significant challenge to come up with such design while also supporting features like peer banning, peer selection, peer shuffling, preventing messages from rebounding back to the sender, preventing spam and duplicates... During the design phase, I was tempted many times to add some kind of business domain awareness to the P2P module but managed to resist until the project's completion.

The result was that the P2P module ended up with a very simple interface and was very versatile. Because it wasn't tied to any specific business domain, it could be used for a broad range of different blockchain consensus mechanisms and didn't require any code updates when business requirements changed. This was useful to us at the time since we had not settled on most details of our consensus mechanism. Also, it could be used for a wide range of other P2P use cases beyond blockchains; later, I was able to use that exact same module (without any changes) to build a DEX (decentralized exchange) with only about 4000 lines of additional custom code.

More interestingly, there was another blockchain project which was similar to the one I was working on and they also decided to have a P2P module but their module had awareness of business domain concepts such as 'transactions' and 'blocks' and their code was much messier, much longer and not reusable at all. They had to update it almost every time their business domain requirements changed and it wasn't as reliable.

It was the exact same problem, but the second project gave in to the temptation of sharing business domain responsibilities across multiple modules (low cohesion) and this led to tight coupling.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: