In case people are wondering, the algorithm is 'connected clustering'
> You can think of Connected Components in very layman's terms as sort of a hard clustering algorithm which finds clusters/islands in related/connected data. As a concrete example: Say you have data about roads joining any two cities in the world. And you need to find out all the continents in the world and which city they contain.
The article doesn't contain a concise pseudo-code description of the algorithm but instead sneaks up on it by throwing in lots of python that implements various graph operations.
> The article doesn't contain a concise pseudo-code description of the algorithm but instead sneaks up on it by throwing in lots of python that implements various graph operations.
True, that's really a lot of code while failing to outline the simple algorithm or concisely describe the basic idea. It doesn't mention complexity either, which is linear.
It might also help beginners to mention the image analogue, i.e. connected components of a regular grid graph, which is called 'flood fill'.
> In case people are wondering, the algorithm is 'connected clustering'
FWIW, I've always heard the algorithm called 'connected components', and a quick Google search on both terms gives me more relevant hits for 'connected components'. In fact, I don't get a single directly relevant hit that describes this algorithm on the first page for 'connected clustering'.
I find articles like this interesting. While I have hired plenty of software engineers and data scientists without Computer Science degrees I do think sometimes people without them are at a disadvantage.
Breadth First Search is a freshman year concept at many schools. Sophomore at the latest.
On the other hand it gives me some respect for people who have to learn this all themselves. Having 4 years to take off from work and just learn this stuff is a huge luxury and not necessary in a lot of cases. But at least for me I am very thankful for the structure of a formal degree. I don't think I could have self-taught all of this.
I'm a mechatronics engineer working as an embedded software engineer. I've had to rapidly learn and teach myself a whole lot to be able to do my job effectively and correctly.
Given that, I studied many of the standard CS algos when I was brushing up for a FAANG interview. I found this gem of a guide: https://bradfieldcs.com/algos/ and after working through the examples, I could understand the behavior and underlying mechanisms of these algorithms. which completely helps when reading about articles like this!
Anyway, I think one would only be at a disadvantage if they are not seeking to continuously learn & satiate one's curiosity.
This is very valid when the formal degree education is a proper one. Unlike here in my place, a small town in India - most of my computer science concepts where just read out aloud by someone - say a Prof and the formal degree was provided based on a theoretical test that was just out-of-Book answers. Even the stats I was taught during early days of my engineering, I have got no clue about when I self taught myself to do ML algorithms.
I guess it's also worth mentioning strongly connected components (SCCs) in this context (i.e. there is a directed path
from any node to any other node in a component).
In the context of data analytics a SCC finding algo can be used to identify bottlenecks of communication/transportation networks and fault tolerance of distributed networks. In this case we are particularly interested in edges that connect SCCs. Another application is clustering. For instance, finding clusters of accounts that all follow each other on Twitter. In this case each such cluster is a SCC.
Kosaraju–Sharir algorithm is an easy to understand SCC algorithm with linear running time that basically consists of two BFS passes.
P.S. NetworkX (https://networkx.github.io/) is a great Python library for working with graphs. It has implementations of many common graph algos including SCC finding algorithm (it is an implementation of Tarjan's algorithm).
I did this in a module called decision mathematics at A-level.
This is a nice implementation of classical graph problems.
Stuff like traveling salesman etc. Dijkstras algorithm.
The bit I am missing is how does this help me solve a day to day problem?
how can this be applied in something modern other than say a shortest path / route finding algorithm?
I suppose it's a way to reach an end point most efficiently and that makes it useful for nueral nets which are essentially tree's with weighted edge values.
the easiest path through the tree is something you could find and that might illuminate a route to optimizing your "thing" whatever that is.
It's actually not even a nice implementation of a classical graph algorithm. They use a Python list as a queue. Python lists are only efficient when used as stacks. calling list.pop(0) or list.remove(list[0]) takes time proportional to the length of the list. This makes the algorithm take quadratic time, when it should be linear.
For this to be a nice implementation, they should use the standard library deque for the BFS or just pop from the end of the list to do a more efficient DFS for the connected component algorithm.
I find useful just for visualizing real-life organizational workflows. Building a graph then let’s you update and layout stuff like specimen transfer across jurisdictions.
So it’s just a cleaner way of storing data relationships that allows for more questions to be asked like “show me all the labs with automated backups” or “show me how many nurses work weekends at more than one lab”
The title is absurdly reductive. Anyone interested in graphs should be encouraged to learn about shortest paths, max flow, spanning trees, etc. and to identify some of the major np-hard graph problems. "Addition - the one Arithmetic Algorithm you need to know".
If you find this interesting, you should check out Donald Knuth’s latest “The Art of Computer Programming, 4A: Combinatorial Algorithms, Part 1”. It goes into a lot of depth about the theory behind all of this. You don’t need to have read the first three for 4A to make sense.
Also known as union-find. Favorite implementation is Sedgewick's weighted quick union with path compression from the first chapter of his algorithms text. Also cool is an implementation in the graph chapter which actually makes use of DFS instead.
> You can think of Connected Components in very layman's terms as sort of a hard clustering algorithm which finds clusters/islands in related/connected data. As a concrete example: Say you have data about roads joining any two cities in the world. And you need to find out all the continents in the world and which city they contain.
The article doesn't contain a concise pseudo-code description of the algorithm but instead sneaks up on it by throwing in lots of python that implements various graph operations.