One of Many Worlds: graph theory

Showing posts with label graph theory. Show all posts

2013-08-14

Ring Detection - 1

In this two-part series, we look at how Ojus Chemistry Toolkit (OCT) currently implements ring detection. In this part, detection of rings and ring systems is described.

Preparation

N.B. OCT's molecule reader converts any hydrogen atoms that it encounters into implicit hydrogens, i.e., the list of atoms and bonds in an OCT molecule never contains a hydrogen atom. Similarly, the number of hydrogens attached to an atom is determined looking at the atom's valence and any charge specified for it.

N.B. The Frerejacque value of the molecule is checked to see if there are any cycles at all, before the following procedure is employed.

When we are interested in finding cycles (and only cycles), we should not look at open branches — those that contain only leaves. Accordingly, we remove them. In reality, all of this computation happens on temporary data structures, not the actual molecule itself.

Removing Terminal Chains

Remember terminal atoms from this old article? Quickly, a terminal atom is one which has a single neighbour. We remove all such atoms. However, removing a terminal atom could make its sole neighbour a terminal atom itself! We have to remove it, too. But, then, removing it … ad infinitum. Here is the pseudocode.

    var removing := true.

    while (removing) {
        removing := false.

        for each atom `a` {
            if `a` is a terminal {
                remove `a`.
                removing := true.
            }
        }
    }

Note that removing an atom entails removing all the bonds in which it participates, etc. Thus, by the time the outer loop exits, all terminal chains will have been removed.

How Many Rings?

One Ring or More?

At this point, if all the atoms have exactly two neighbours, there is only one ring in the molecule. Why? More than one ring implies that a second ring is either fused to the first one, or is independent of it, but connected through a linking chain of atoms. In either case, there should be at least one atom that is a junction — it should have at least three bonds.

Cycle Detection

Case of One Ring

In the case of the molecule having only one ring, we mark that ring as the sole ring. We also create a single ring system with that sole ring. And, we are done!

Case of Multiple Rings

Detection of two or more rings present in the molecule is (rather) evidently more complex than that of a single ring. Over the last decade, or so, I have written four or five distinct algorithms to detect multiple rings in a molecule. All of them employed a recursive depth-first approach. The recursive approach made certain parts of the algorithm easier and clearer, while complicating certain others.

This time, though, I chose to employ a breadth-first approach. The outline is as follows.

    var candidates := new list of paths.
    var path := new list of atoms.

    var a := the first non-junction atom in the molecule.
    if no such atom exists {
        a := the first atom in the molecule.
    }
    path := `path` with `a` appended.
    candidates := `candidates` with `path` appended.

    while (`candidates` is not empty) {
        var test-path := fetch the first path in `candidates`.
        try-path `test-path`.
    }

As you can see, I have a queue into which I inject all candidate paths. A candidate path is potentially incomplete, and may need to be extended by adding a neighbour of the last atom in it, should one such exist. But that is part of processing the path.

Process a Test Path

    var current := last atom in `test-path`.

    for each neighbour `nbr` of `current` {
        if `nbr` was visited already in `test-path` {
            var ring := path from previous occurrence of `nbr` to `current`.
            if `ring` is valid {
                record `ring`.
            }
            continue loop.
        }

        var new-path := `test-path` with `nbr` appended.
        candidates := `candidates` with `new-path` appended.
    }

Thus, we add each path extension to the queue of candidates, at each atom that we encounter. Now, given a candidate ring, how do we validate it? First, the easy case: if the candidate has only three atoms, then it is certainly a genuine ring. Good! But, what if it has more than three?

Preliminary Validation

    if size of `test-path` is 3 {
        answer `true`.
    }

    for each atom `a` in `test-path` {
        if more than 2 neighbours of `a` exist in `test-path` {
            /* Must be a pair of rings fused by a single bond. */
            answer `false`.
        }
    }

    /* Must be a simple closed curve. */
    answer `true`.

Ring System Detection

Once we have recorded all rings, we sort them by size. The next task is to partition this set of rings into equivalence classes of ring systems. Two given rings belong to the same ring system if they have at least one bond or at least one atom in common.

Partitioning

The actual task of partitioning the set of rings is quite straightforward.

    var ring-systems := new list.

    outer:
    for each recorded ring `r` {
        for each ring system `rs` {
            if `r` and `rs` share a bond or
               `r` and `rs` share an atom {
                mark `r` as belonging to `rs`.
                rs := `rs` with `r` appended.
                continue `outer`.
            }
        }

        var new-rs := new ring system.
        new-rs := `new-rs` with `r` appended.
        ring-systems := `ring-systems` with `new-rs` appended.
    }

Conclusion

We have, now, identified the rings in the molecule, and have partitioned the set into ring systems. However, this set could contain potentially spurious rings. Eliminating the false candidates is a surprisingly more difficult procedure. We shall look into it in the next post!

2012-04-07

Graphs and molecules - 2

Note: This post utilises MathJax to display mathematical notation. It may take a second or two to load, interpret and render; please wait!

If you have not read the previous post in this series, please read it first here.

Ordering

The notion of ordering is very intuitive in the context of natural numbers. Indeed, when we learn natural numbers, their representation \(\bbox[1pt]{\{1, 2, 3, 4, \ldots\}}{}\) itself imprints an ordering relationship in our minds. Soon enough, we learn to assign a sense of relative magnitude to those numbers: 4 is larger than 2, etc. This concept extends naturally to negative numbers and rational numbers too.

A little rigour

Suppose that we represent the ordering relationship between two elements of a set using the symbol \(\le\). Then, we can define the properties that a set \(S\) should satisfy for it to be ordered.

Reflexivity: \(a \le a\forall a \in S\)
Antisymmetry: if \(a \le b\) and \(b \le a\), then \(a = b\ \forall a, b \in S\)
Transitivity: if \(a \le b\) and \(b \le c\), then \(a \le c\ \forall a, b, c \in S\)

We can readily see that integers and rational numbers satisfy the above properties. Accordingly, we say that integers and rational numbers are ordered, if we assign the meaning smaller than or equal to to the ordering relationship \(\le\).

In fact, we can see that integers and rational numbers also satisfy an additional property.

Totality: \(a \le b\) or \(b \le a\ \forall a, b \in S\)

A distinction

Totality is a stricter requirement than the preceding three. It mandates that an ordering relationship exist between any and every pair of elements of the set. While reflexivity is easy enough to comprehend, the next two specify the conditions that must hold if the elements concerned do obey an ordering relationship.

It is easy to think of sets that satisfy the former three properties, but without satisfying the last. As an example, let us consider the set \(X = \{1, 2, 3\}\). Now, let us construct a set of some of its subsets \(S = \{\{2\}, \{1, 2\}, \{2, 3\}, \{1, 2, 3\}\}\). Let us define the ordering relationship \(\le\) to mean subset of represented by \(\subseteq\). Exercise: verify that the first three properties hold in \(S\).

We see that \(\{1, 2\}\) and \(\{2, 3\}\) are elements of \(S\), but neither is a subset of the other.

Therefore, mathematicians distinguish sets satisfying only the first three from those satisfying all the four. The former are said to have partial ordering, and they are sometimes called posets or partially-ordered sets. The latter are said to have total ordering.

More ordering

Now, let us expand the discussion to include irrational numbers. Do our definitions apply? There is an immediate difficulty: irrational numbers have non-terminating decimal parts! How do we compare two such numbers? How should we define the ordering relationship? The integral part is trivial; it is the decimal part that presents the difficulty.

Sequence comparison

In order to be able to deal with irrational numbers, we have to introduce an additional notion — sequences. A sequence is a set (finite or infinite) where the relative positions of the elements matter. Another distinction is that elements can repeat, occurring at multiple places. The number of elements in a sequence, if it is finite, is called its length. Thus, sequences can be used to represent the decimal parts of irrational numbers.

Let \(X = \{x_1, x_2, x_3, \ldots\}\) and \(Y = \{y_1, y_2, y_3, \ldots\}\) be two sequences. We can define an ordering relationship between sequences as follows. We say \(X \le Y\) if one of the following holds.

\(X\) is finite with a length \(n\), and \(x_i = y_i\ \forall i \le n\) and \(y_{n+1}\) exists.
\(X\) and \(Y\) are infinite, and \(\exists\ n\) such that \(x_i = y_i\ \forall i \le n\), and \(x_{n+1} \le y_{n+1}\).

Armed with the above definition, we can readily see that we can compare two irrational numbers — in fact, any two sequences. Exercise: verify this claim by comparing two irrational numbers and two sequences of non-numerical elements!

Bottom and top elements

If a set \(S\) of sequences has an element \(b\) such that \(b \le s\ \forall s \in S\), the element \(b\) is called the bottom element of the set. The element t that we get if we replace \(\le\) with \(\ge\) is called the top element of the set. The bottom element and the top are unique in a given set.

In our first example above, \(\{2\}\) is the bottom element of the set, while \(\{1, 2, 3\}\) is the top. However, it is important to understand that bottom and top elements may not exist in a given set of sequences. Exercise: think of one such set.

Minimal and maximal elements

When a set does not have a bottom element, it is yet possible for it to have minimal elements. For an element \(m\) to be a minimal element of the set \(S\), \(s \le m \implies s = m\) should hold. If we replace \(\le\) with \(\ge\), we get maximal elements.

Minimal and maximal elements are difficult to establish (and, sometimes, even understand) in the context of infinite sets or complex ordering relationships. The same applies to bottom and top elements, too.

Conclusion

You may have begun wondering if the title of this post was set by mistake. On the contrary, these concepts are very important to understand before we tackle canonical representation of molecules, ring systems in molecules, etc., which we shall encounter in future posts.

2011-07-28

Performance improvement - 3

Hmm … a new addition to the ring detection algorithm set brings back the distance matrix into the game. Over the last week, I was interacting with potential users, trying to assess the accuracy of the algorithm. They wanted an SSSR (Smallest Set of Smallest Rings) result set for each input molecule, not a more exhaustive set that I was deeming necessary.

Accordingly, I implemented a new SSSR algorithm. This relies on eliminating contours which can be short-circuited. That, in turn, requires shortest paths between all atoms - pair-wise - to be determined first. So, the matrix is back!

Consequently, the running time for my test set of 19946 molecules is now in the range of 16.5-17s. Preliminary profiling shows that building this matrix for each molecule takes almost 60% of the running time. I should investigate better spanning tree algorithms!

2011-07-01

Performance improvement - 2

No sooner had I finished my previous post than I began reviewing the algorithmic aspects of ring discovery in the program. I observed two points standing out.

During the analysis of the molecule's graph, we build a matrix of pair-wise distances between all atoms of the molecule.
During ring discovery, we remove all terminal chains (remember terminal nodes from here?), and then explore the remaining graph to detect rings.

As I reviewed the code, I saw that I was recursively walking the graph in both cases. When building the distance matrix, the logic is simple: directly bonded neighbors are at a distance of 1; recurse over the neighbors of each neighbor, and increment the distance by 1 at each level. Whenever we encounter a shorter path between the same pair, we record this shorter distance. Exercise: Why do multiple paths exist at all? A similar algorithm is followed for ring discovery, but walking only the subgraph comprising non-terminal atoms.

Whenever a ring is detected, it is necessary to determine if it is a Jordan curve (simple closed path) or not. We are interested in Jordan curves; we mark others as fused systems. How do we, then, determine if the current path is a Jordan curve? The parent molecule of every component has a map that holds information about bonds. The key of the map is a hash; the value, the bond itself. Suppose that the bond is between the atoms a₁ and a₂. The hash is computed as: 10000 * a_L.id + a_G.id, where a_L is the atom with the lower ID of the two, and a_G is that with the higher. This map is constructed when parsing the input, and is, consequently, ready by the time ring discovery is requested. Clearly, it is adequate to determine if a given path is a Jordan curve or not. How? :-)

I suddenly recollected that in the first version of the ring discovery algorithm, the distance matrix was used to test if a path was a Jordan curve. In that version, it was necessary to pre-calculate the distance matrix. With it being no longer mandatory, I removed the call to build it during ring discovery.

Lo and behold — the running time dropped from 18.5-19s to 6.5-7s! This time, the improvement in performance had nothing to do with Go, and everything to do with my own oversight. Needless to say, I am happier now! :-)

2011-05-31

Graphs and molecules - 1

Background

Mathematics - referred to as the queen of the sciences by that great Karl Gauss - comprises a large number of fields. It provides abstractions of varying sophistication, together with a vast body of more specific results. Several of them have applications in areas far beyond their origins.

One such that is (albeit indirectly) relevant to our current context is topology. Topology can be defined as the study of properties of objects that are invariant under continuous maps. Topology has evolved into a large field, whose major components include point-set topology, algebraic topology and geometric topology. Of these, point-set topology lays the foundation, defining most of the basic concepts. People who take a course in mathematics at the college level are familiar with several of them, usually in specific forms as applicable to real numbers, etc. In generalized forms, concepts such as continuity remain simple, while others such as compactness (every open cover should have a finite subcover) can be surprisingly subtle!

Curves

Let I be an interval [x, y] of real numbers, with x ≠ y. Let f:I → R². Further, let f be a continuous map. We can call the range of f a curve. It is common to call f itself the curve.

Now, suppose that the restriction of f to [x, y) is injective (i.e., f(a) = f(b) ⇒ a = b ∀ a, b ∈ [x, y)). The curve does not intersect itself, since the map is injective except for the single point y. Further suppose that f(x) = f(y). Such a curve is called a simple closed curve or a Jordan Curve.

Some texts define I above to be the interval [0, 1]. Another commonly found definition is: a Jordan Curve in R² is one for which it is possible to construct an injective continuous map g:S¹ → R², where S¹ is a circle. All the definitions are equivalent. Exercise: how? :-)

Graphs

Evidently, when we move to a discrete scenario, concepts such as continuity have to be adapted appropriately. Also, the discrete scenario brings two distinct kinds of entities into the discussion: nodes (or vertices) and edges. Note that the same notions exist in geometry as well. For instance, a triangle has three nodes and three edges. A graph is a collection of nodes and edges, where each node has at least one edge (i.e., one adjacent node).

A path can be defined as the sequence v_i of vertices, where an edge e_i exists between the nodes v_i and v_i+1. The index i varies in the discrete interval [1, n], for an applicable value of n.

Analogous to the continuous scenario, we can define a simple closed path or a simple cycle as a path with no repeated nodes except for the starting/ending node.

For our immediate purposes, we are mostly interested in undirected graphs.

Connectedness, terminal nodes

Let us denote the nodes of a graph by N, and its edges by E. For x, y ∈ N, x ≠ y, if there exists a path from x to y, we say that x and y are connected. If the above holds good ∀ x, y ∈ N, then we say that the graph itself is connected.

Suppose that the graph is not connected. Then it contains two or more connected subgraphs. Obviously, they are pair-wise disjoint, i.e., any given node belongs to one and exactly one such subgraph. Similarly, any given edge belongs to one and exactly one such subgraph.

A node with only one adjacent node connected to it (even though it can have any number of edges to that single adjacent node) is called a terminal node. As a special case, we note that a node that is part of a simple closed path cannot be a terminal node. Exercise: why? :-)