Breaking Code

April 8, 2013

A Python example on finding connected components in a graph

Filed under: Programming — Tags: , , — Mario Vilas @ 10:30 pm

Today I’ve been coding a solution for a problem we’ve encountered with @ggdaniel (cr0hn) during the development of GoLismero 2.0. It called for an implementation of an algorithm to find connected components in an undirected graph. You can find the source code at the bottom of this post.


A graph algorithm a day keeps the CS doctor away…

Suppose we have an undirected graph (connected by lines rather than arrows) in which we can find one or more “islands” of nodes that form connections to each other, but not to nodes in other “islands”. In graph theory, these “islands” are called connected components. In the image below, we see a graph with three connected components:

Example graph with three connected components. Image from Wikipedia.

Now, suppose we have a set containing all nodes, and we can visit each node to know what are its neighbors, that is, the other nodes it’s connected to. We want to find all the connected components and put their nodes into separate sets. How would we do that?

Luckily, according to the Internet (what would I do without it! get a “proper” job I guess) there’s a well known algorithm to solve this problem. :) and it would go more or less like this. Grab a random node from the graph, and add it to a new set. Now get all the neighbors of this node, discard the ones we already visited (so we don’t get stuck in an infinite loop), add them to the set, and keep doing the same thing from the beginning with each neighbor, recursively. When we’re done visiting neighbors, that means we finished finding one of the connected components – we can start over with a new random node we haven’t visited yet to find the next connected component, and so on until we visited all the nodes.

Confused? Let’s start over, this time with an example.

connected-components-1

In the above example we’d start with a set containing 5 nodes (A through E), where B is connected to A and D, and C is connected to E. So let’s pick a random node, say C. We paint C in blue and get the neighbors, in this case it’s just E. Now we visit E, so we paint it blue and get the neighbors, now it’s C again. Since we already visited C (we know because it’s already painted blue) we discard it. Now we have no more nodes to visit, so we’re finished getting the first connected component.

connected-components-2

But that was the easy one, and we’ve still got 3 more nodes to go. Let’s pick a random one from the ones we didn’t visit yet, that is, the nodes that remain gray (A, B and D), for example A. We paint it green and get the neighbors, which is just B. Now we visit B, paint it green and get the neighbors, A and D. But A was already visited (it’s green already) so we discard it, leaving only D. Finally we visit D, paint it green, and get its neighbor B. Since B was already visited we discard it. This leaves no more nodes to visit, and we’re finished.

connected-components-3


But what was the problem I was trying to solve?

If you came here only looking for the graph theory bit then you can safely skip this part ;) but if I can interest you with some tidbits on the internals of GoLismero, here’s a summary:

During the execution of the program, we have a given number of plugins all running concurrently, and a series of messages with data that are sent to each plugin for processing. Each plugin may also create new data objects and send them to the main process, who in turn re-sends it to the rest of the plugins. For example, the Web Spider plugin would crawl a website and create URL objects for each link it finds, and HTML objects for each page it downloads; then those objects are sent to other plugins that would analyze the URLs and the HTML pages looking for vulnerabilities, and creating Vulnerability objects describing each vulnerability they find. Another plugin may receive a vulnerability and exploit it -for example, an URL disclosure vulnerability- creating new URL objects, which in turn get sent to the Web Spider for crawling, and so on.

The data objects also contain references to each other — for example, if a data object contains an HTML page, it’ll also have a reference to another data object with the URL where it was found, and vice versa. If a vulnerability is found in that page, another data object will be generated with the description of the vulnerability, and it will also reference the URL where the vulnerability was found. The URL, in turn, will reference the vulnerability that was found on it.

More generally, this leaves us with an undirected graph of data objects, since any data object can reference any other data object, but references always go back and forth. Also, we store every data object in a database, where they can be consulted at any time by a plugin.

Now, this would be the problem: before we can send any data object to a running plugin, we have to make sure all other data objects it references are already stored in the database — otherwise, we might run into a race condition between the time the data object is sent to the plugins and the objects it references are stored in the database. If we fail to solve this, a plugin would occasionally find a data object references another object that can’t be found in the database.

To prevent this, a simple algorithm was needed to find clusters of objects that reference each other (a connected component). So as plugins send messages to the main process, the main process holds them until it can be sure it has a group of objects that aren’t referencing any data that hasn’t yet arrived. When the whole cluster is complete, it’s stored in the database and sent to the rest of the plugins.


Enough of this nonsense already, where’s the example code?!

Behold! Here it is, in all of its syntax-colored glory. (?)

Enjoy! :)

example-connected-components.py

    #!/usr/bin/env python
    
    # Finding connected components in a bidirectional graph.
    # By Mario Vilas (mvilas at gmail dot com)
    
    # The graph nodes.
    class Data(object):
        def __init__(self, name):
            self.__name  = name
            self.__links = set()
    
        @property
        def name(self):
            return self.__name
    
        @property
        def links(self):
            return set(self.__links)
    
        def add_link(self, other):
            self.__links.add(other)
            other.__links.add(self)
    
    # The function to look for connected components.
    def connected_components(nodes):
    
        # List of connected components found. The order is random.
        result = []
    
        # Make a copy of the set, so we can modify it.
        nodes = set(nodes)
    
        # Iterate while we still have nodes to process.
        while nodes:
    
            # Get a random node and remove it from the global set.
            n = nodes.pop()
    
            # This set will contain the next group of nodes connected to each other.
            group = {n}
    
            # Build a queue with this node in it.
            queue = [n]
    
            # Iterate the queue.
            # When it's empty, we finished visiting a group of connected nodes.
            while queue:
    
                # Consume the next item from the queue.
                n = queue.pop(0)
    
                # Fetch the neighbors.
                neighbors = n.links
    
                # Remove the neighbors we already visited.
                neighbors.difference_update(group)
    
                # Remove the remaining nodes from the global set.
                nodes.difference_update(neighbors)
    
                # Add them to the group of connected nodes.
                group.update(neighbors)
    
                # Add them to the queue, so we visit them in the next iterations.
                queue.extend(neighbors)
    
            # Add the group to the list of groups.
            result.append(group)
    
        # Return the list of groups.
        return result
    
    # The test code...
    if __name__ == "__main__":
    
        # The first group, let's make a tree.
        a = Data("a")
        b = Data("b")
        c = Data("c")
        d = Data("d")
        e = Data("e")
        f = Data("f")
        a.add_link(b)    #      a
        a.add_link(c)    #     / \
        b.add_link(d)    #    b   c
        c.add_link(e)    #   /   / \
        c.add_link(f)    #  d   e   f
    
        # The second group, let's leave a single, isolated node.
        g = Data("g")
    
        # The third group, let's make a cycle.
        h = Data("h")
        i = Data("i")
        j = Data("j")
        k = Data("k")
        h.add_link(i)    #    h----i
        i.add_link(j)    #    |    |
        j.add_link(k)    #    |    |
        k.add_link(h)    #    k----j
    
        # Put all the nodes together in one big set.
        nodes = {a, b, c, d, e, f, g, h, i, j, k}
    
        # Find all the connected components.
        number = 1
        for components in connected_components(nodes):
            names = sorted(node.name for node in components)
            names = ", ".join(names)
            print "Group #%i: %s" % (number, names)
            number += 1
    
        # You should now see the following output:
        # Group #1: a, b, c, d, e, f
        # Group #2: g
        # Group #3: h, i, j, k
About these ads

3 Comments »

  1. Link does not work.

    Comment by Henri Salo — May 1, 2013 @ 9:54 pm

  2. You’re right, it wasn’t working. I’ve just fixed it. Thanks!

    Comment by Mario Vilas — May 2, 2013 @ 2:42 pm

  3. For me is usefull to resolve parallel task execution and network availability/minimal performance problems, using critical path method (CPM) and PERT, according of the context, I translate the variables “employed”, “item”,”event”… by resources like bandwidth needs, computing cost, and the different kind of “time” by aproximations values of process complexity , medium time desired, or others variables which limit or condition the problem.

    The best thing of all, it is very easy to optimization expressing the graph as a matrix, and after applying, the critical path algorithm for obtain all possibles parallels tasks and their executions orders.

    PERT/CPM example, http://www.springer.com/cda/content/document/cda_downloaddocument/9783642251740-c1.pdf

    CPM http://en.wikipedia.org/wiki/Critical_path_method

    Comment by Borja (@b0rh) — June 20, 2013 @ 7:37 pm


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

The Silver is the New Black Theme. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

Join 2,481 other followers

%d bloggers like this: