2011-05-31

Graphs and molecules - 1

Background


Mathematics - referred to as the queen of the sciences by that great Karl Gauss - comprises a large number of fields. It provides abstractions of varying sophistication, together with a vast body of more specific results. Several of them have applications in areas far beyond their origins.

One such that is (albeit indirectly) relevant to our current context is topology. Topology can be defined as the study of properties of objects that are invariant under continuous maps. Topology has evolved into a large field, whose major components include point-set topology, algebraic topology and geometric topology. Of these, point-set topology lays the foundation, defining most of the basic concepts. People who take a course in mathematics at the college level are familiar with several of them, usually in specific forms as applicable to real numbers, etc. In generalized forms, concepts such as continuity remain simple, while others such as compactness (every open cover should have a finite subcover) can be surprisingly subtle!

Curves


Let I be an interval [x, y] of real numbers, with xy. Let f:IR2. Further, let f be a continuous map. We can call the range of f a curve. It is common to call f itself the curve.

Now, suppose that the restriction of f to [x, y) is injective (i.e., f(a) = f(b)a = b ∀ a, b ∈ [x, y)). The curve does not intersect itself, since the map is injective except for the single point y. Further suppose that f(x) = f(y). Such a curve is called a simple closed curve or a Jordan Curve.

Some texts define I above to be the interval [0, 1]. Another commonly found definition is: a Jordan Curve in R2 is one for which it is possible to construct an injective continuous map g:S1R2, where S1 is a circle. All the definitions are equivalent. Exercise: how? :-)

Graphs


Evidently, when we move to a discrete scenario, concepts such as continuity have to be adapted appropriately. Also, the discrete scenario brings two distinct kinds of entities into the discussion: nodes (or vertices) and edges. Note that the same notions exist in geometry as well. For instance, a triangle has three nodes and three edges. A graph is a collection of nodes and edges, where each node has at least one edge (i.e., one adjacent node).

A path can be defined as the sequence vi of vertices, where an edge ei exists between the nodes vi and vi+1. The index i varies in the discrete interval [1, n], for an applicable value of n.

Analogous to the continuous scenario, we can define a simple closed path or a simple cycle as a path with no repeated nodes except for the starting/ending node.

For our immediate purposes, we are mostly interested in undirected graphs.

Connectedness, terminal nodes


Let us denote the nodes of a graph by N, and its edges by E. For x, y ∈ N, xy, if there exists a path from x to y, we say that x and y are connected. If the above holds good ∀ x, yN, then we say that the graph itself is connected.

Suppose that the graph is not connected. Then it contains two or more connected subgraphs. Obviously, they are pair-wise disjoint, i.e., any given node belongs to one and exactly one such subgraph. Similarly, any given edge belongs to one and exactly one such subgraph.

A node with only one adjacent node connected to it (even though it can have any number of edges to that single adjacent node) is called a terminal node. As a special case, we note that a node that is part of a simple closed path cannot be a terminal node. Exercise: why? :-)

2011-05-24

Simple task distribution to worker goroutines -- example

The utility program stsearchsdf that I wrote about previously has a simple supervisor-workers setup for processing the input molecules in parallel. To begin with I have an iterator that reads the input SDF file, and returns the text of the next molecule, on each call to Next(). The input file is represented by a *bufio.Reader as shown here.

// SDFile represents an on-disk file in MDL's SDF format.
type SDFile struct {
    f *os.File
}

// Init initializes an SDFile with the underlying file.
func (s *SDFile) Init(fn string, flag int, mode uint32) os.Error {
    f, e := os.OpenFile(fn, flag, mode)
    if e != nil {
        return e
    }

    s.f = f
    return nil
}

// Close closes the underlying file.
func (s *SDFile) Close() {
    s.f.Close()
}

// SdfIter is an iterator over SDFile.
type SdfIter struct {
    r *bufio.Reader
}

Once we open the input SDF file, we can request for an iterator as follows.

// Iter creates and returns an iterator over the current SDF file's
// molecules.
func (s *SDFile) Iter() *SdfIter {
    it := new(SdfIter)
    it.r, _ = bufio.NewReaderSize(s.f, BUF_SIZE)
    return it
}

This iterator utilizes the sequence of characters that separates consecutive molecules (the sequence is $$$$) to read the text corresponding to one molecule. It answers that text in the success scenario.

// Next reads the text of one molecule from the input buffered stream
// into a slice, and returns the same.
func (it *SdfIter) Next() (v string, b bool, err os.Error) {
    sl := make([]string, 0, SLICE_SIZE)

    for {
        s, e := it.r.ReadSlice(REC_SEP_H)
        if e != nil {
            err = io.ErrUnexpectedEOF
            return
        }
        sl = append(sl, string(s))
        s, e = it.r.ReadSlice('\n')
        if e != nil {
            err = io.ErrUnexpectedEOF
            return
        }
        sl = append(sl, string(s))

        if REC_SEP_T == strings.TrimRightFunc(string(s), IsWspace) {
            b = true
            break
        }
    }
    v = strings.Join(sl, "")

    return
}

This iterator is then made use of by the supervisor to read the molecules one at a time. Meanwhile, the supervisor creates the necessary communication channels. Since the nature of the actual task varies based on the user's request, the function that represents the task itself (simply denoted fn here) is injected into the supervisor loop function. opts.mx determines the number of goroutines to spawn, and is taken from the command line.

    tchs := make([]chan TMsg, opts.mx)
    rch := make(chan RMsg, opts.mx)
    for j := 0; j < int(opts.mx); j++ {
        tch := make(chan TMsg)
        tchs[j] = tch
        go fn(j, tch, rch)
    }

The supervisor keeps track of whether the entire input file is processed, whether the search result has been found, and the number of live worker goroutines. When there are no more worker goroutines, the supervisor loop itself can exit.

    // Loop over the file.
    found := false
    eof := false
    liveHelpers := opts.mx
L1: for {
        if 0 == liveHelpers {
            break L1
        }

When the result has been found or the input file is exhausted or there was an error, all workers are sent termination messages, which each of them acknowledges. The acknowledgement is shown later.

        select {
        case rmsg := <-rch:
            if (-1 == rmsg.mn) && (!rmsg.b) {
                liveHelpers--
                continue L1
            }

            if !found && rmsg.b {
                fmt.Println("-- Molecule number :", rmsg.mn, "\n")
                fmt.Println(rmsg.v)
                found = true
            }

            if found || eof {
                tchs[rmsg.id] <- TMsg{-1, nil, nil}
                continue L1
            }

We are now ready for the task distribution itself. A defined number of molecules serves as the quantum task size. This grouping helps in reducing the number of messages sent back-and-forth over the channels.

            var tmsg TMsg
            tmsg.mn = i
            tmsg.opts = opts
            tmsg.v = make([]string, 0, TASK_SIZE)
            k := 0
L2:         for k = 0; k < TASK_SIZE; k++ {
                v, b, _ := it.Next()
                if !b {
                    break L2 // TODO: This may need better handling.
                }
                tmsg.v = append(tmsg.v, v)
                i++
                if i >= opts.to {
                    break L2
                }
            }

            if k > 0 {
                tchs[rmsg.id] <- tmsg
            } else {
                tchs[rmsg.id] <- TMsg{-1, nil, nil}
                continue L1
            }
        }
    }

Once we exit the above loop, we can close the receiving channel since all the workers will have exited by then.

Now we take a look at one of the task functions that is run in a worker goroutine. As soon as a worker is created, it sends an idle message to the supervisor.

func searchaHelper(id int, tch chan TMsg, rch chan RMsg) {
    rch <- RMsg{id, 0, "", false, nil}

In response, the supervisor starts assigning work to it. The task handling loop is quite simple: it exits when it receives a termination request (which it first acknowledges); else it processes input data.

    for {
        tmsg := <-tch
        if -1 == tmsg.mn {
            rch <- RMsg{id, -1, "", false, nil}
            close(tch)
            break
        }

        var rmsg RMsg
        rmsg.id = id
        rmsg.b = false
        for k := 0; k < len(tmsg.v); k++ {
            m := mdl.NewMolecule()
            rmsg.e = m.ParseMol(strings.Split(tmsg.v[k], "\n", -1), skip)
            if nil == rmsg.e {
                if m.HasAtoms(tmsg.opts.acount, tmsg.opts.comp) {
                    rmsg.mn = tmsg.mn + k + 1
                    rmsg.v = tmsg.v[k]
                    rmsg.b = true
                    break
                }
            }
        }

        rch <- rmsg
    }

Finally, depending on the actual outcome of the search, we show an appropriate message to the user.

The above is conceptually simple, and easy to understand. For production, we have to introduce the requisite error handling code, of course. But, Go makes it really easy to think about the task distribution structure at a reasonably high level. What is more, it allows us to implement such a structure in a straight-forward manner, too!

2011-05-23

Large files and multiple cores - 2

In November 2010, I wrote a small utility program - in Go - to view and search for molecules in an MDL SDF format file. I wrote about it here. As the program turned out to be quite useful, I cleaned it up, and made it more streamlined and modular. Here is its current version of the function that prints help text.
func printHelp() {
    str := `
NAME
    stsearchsdf - view, extract from and search an SDF file

SYNOPSIS - GENERAL FORM
    stsearchsdf command [params]

SYNOPSIS - SPECIFIC FORMS
    stsearchsdf help

    stsearchsdf show in=filename [from=m] [to=n]

    stsearchsdf copy in=filename out=filename [from=m] [to=n]

    stsearchsdf searcha in=filename (symbol=count)... [comp=(eq|gt|lt)]
    [from=m] [to=n] [mx=c]

    stsearchsdf searcht in=filename (tag=tagname tagval=tagvalue)... [from=m]
    [to=n] [mx=c]

    stsearchsdf counta in=filename (symbol=count)... [comp=(eq|gt|lt)]
    [from=m] [to=n] [mx=c]

    stsearchsdf countt in=filename (tag=tagname tagval=tagvalue)... [from=m]
    [to=n] [mx=c]

DESCRIPTION
    stsearchsdf is a program that can be used to view or extract parts of an
    SDF file.  It can also be used to search an SDF file for molecules that
    contain atoms of given elements in specified number, or those with
    specified tag values.

COMMANDS
    help
        Display this help text, and exit.

    show
        Display the sequence of molecules in the specified serial number
        range.

    copy
        Copy the sequence of molecules in the specified serial number range
        to the given output file.

    searcha
        Search for and display the first matching molecule containing specified
        numbers of given atom types.

    searcht
        Search for and display the first matching molecule containing specified
        tag values.

    counta
        Display the number of matching molecules containing specified numbers
        of given atom types.

    countt
        Display the number of matching molecules containing specified tag
        values.

OPTIONS
    in
        Input SDF file name.

    out
        Output SDF file name.  Any existing output file will be overwritten.
        Applicable to only the command 'copy'.

    from
        Number of the molecule where processing should start.  Defaults to 1.

    to
        Number of the molecule where processing should stop.  Defaults to the
        last molecule in the file.

    symbol
        Atomic symbol.  This should be given in proper case, e.g. C, Na.

    comp
        The numeric comparison to be used.  Use 'eq' for equality check, 'gt'
        for greater than, and 'lt' for less than.  Defaults to 'eq'.

    tag
        Name of the tag whose value follows.

    tagval
        Value of the tag field whose name this follows.

    mx
        Maximum number of threads to use for processing.  Defaults to 2.
        Applicable to only search and count commands.
`
    fmt.Println(str)
}

2011-05-22

Legacy code and multi-threading

After a hiatus of a few months, here is an update. What follows is an e-mail that I sent to my collaborators regarding the future technological direction of our organic chemistry product.
* * * *
Apart from resuming my work on importing available chemicals databases, etc., I have spent some time over the last few weeks studying recent developments in making programs multi-threaded, etc. I have written several multi-threaded programs in the past. Usually, the following are the important problems that arise in the context of multi-threaded programs:
  1. identifying the variables (mutable state of data) that need to be accessed from more than one thread,
  2. determining the locking mechanisms needed to protect simultaneous access to such variables from multiple threads, and
  3. using appropriate synchronization mechanisms for proper control flow and cause-effect relationships.
In contrast to the model described above, an alternative approach was proposed by Hewitt, Bishop and Steiger in 1973. It is called the Actor model. It comprises independent objects (all objects are actors) communicating via messages. No other form of interaction exists between independent objects. This communication can be synchronous or asynchronous, but is usually asynchronous by default. Also, actors can create new actors. Actors can also maintain state, designating what their behavior should be in response to future incoming messages.
Yet another alternative was proposed in 1978 by Hoare, called Communicating Sequential Processes (CSP). While being similar in spirit to the Actor model (all objects are communicating processes), CSP differs in a few aspects. For instance, in the Actor model, the sender of a message needs to know the address of the receiver in order to send a message. In CSP, communication happens through channels. A sender need not be aware of the identity of the receiver at the other end of the channel. Also, in CSP, message-passing is in itself a synchronization mechanism: a sender cannot complete the process of sending a message unless the receiver is ready to accept one. In the Actor model, on the other hand, the `system' itself holds the messages that receivers are not yet ready to accept.
Please refer to http://en.wikipedia.org/wiki/Actor_model and http://en.wikipedia.org/wiki/Communicating_sequential_processes for introductory details.
It is pertinent to note that neither the Actor model nor CSP mandates conventional locks or globally-shared data! It is known in the industry that locking and lock-based synchronization are difficult to implement correctly, and even more difficult to scale up to multi-processor scenarios.
About 2-3 years ago, when I used Scala (http://www.scala-lang.org) for a few programs, I utilized the Actor model (provided in Scala through a standard library) to develop a framework for lock-less multi-threading that could utilize all the available processors. That was, of course, before I had heard of Akka. I employed it successfully for processing large amounts of data from the chemical databases of one of my clients.
Some time last year, I became aware of, and interested in, the Go (http://golang.org) language. These days, I use it for most new programs that I develop. Go implements CSP at the language level itself. I have found it to be very useful and convenient for developing multi-threaded programs that operate in a lock-less manner. I have recently written a utility program, using Go, to view and search for molecules in MDL's SDF file format.
Due to C and C++ not having an inherent notion of threading or message passing, there exists no straight-forward way of making ST's main program multi-threaded. We have to employ the Boost library or the MPI(Message-Passing Interface) library or (worse) POSIX threads directly or something similar, in order to wrap certain parts of the current main program. I remember that you looked at MPI (OpenMPI, in particular) several months ago
In my (rather limited) experience, both Boost libraries and MPI present steep learning gradients. They are large, their APIs are large, they are inherently complex, and the very amount of code they require us to write is significantly high. I am, in fact, more concerned about the effort needed to debug and maintain such code in the long run. In contrast, I have found Scala's Actor model and Go's CSP model to be welcome reliefs by way of their simplicity and ease of use.
Scala is a language implemented for the Java Virtual Machine (JVM). This, unfortunately, means that should we choose to employ Scala, our clients have to install Java in their Linux computers. Installation of Java is a decision outside our control, and is made by their respective IT departments.
Go, on the other hand, is similar to C and C++: it compiles to native code, and does not require a virtual machine. Consequently, we can give a new main program to our clients without asking them to install a JVM first. More importantly, we can easily call C code from within Go, making it feasible to re-use select parts of the current main program in a more flexible manner.
In view of the above, after several weeks of study and trials, I think that we should employ Go for making ST's main program multi-threaded. This applies to all areas of the main program that lend themselves to multi-threading, in general, and to the convergent synthesis problem, in particular.
I have written a small part of the new SDF import program in Go. Apart from a particular annoyance that I encountered (because of the absence of generics in Go), I should say that I was delighted to see how easily the program could be written multi-threaded to utilize both the processors in my computer. The effective increase in speed was about 50%, and I think that it can increase even further with some optimization.
The above may be unclear or confusing in parts. Please ask me questions if it is so. However, kindly note that this is an important decision that has to be made, and which cannot be reversed (that easily) later. Accordingly, the more the number of questions that we consider now, the better.