2011-05-22

Legacy code and multi-threading

After a hiatus of a few months, here is an update. What follows is an e-mail that I sent to my collaborators regarding the future technological direction of our organic chemistry product.
* * * *
Apart from resuming my work on importing available chemicals databases, etc., I have spent some time over the last few weeks studying recent developments in making programs multi-threaded, etc. I have written several multi-threaded programs in the past. Usually, the following are the important problems that arise in the context of multi-threaded programs:
  1. identifying the variables (mutable state of data) that need to be accessed from more than one thread,
  2. determining the locking mechanisms needed to protect simultaneous access to such variables from multiple threads, and
  3. using appropriate synchronization mechanisms for proper control flow and cause-effect relationships.
In contrast to the model described above, an alternative approach was proposed by Hewitt, Bishop and Steiger in 1973. It is called the Actor model. It comprises independent objects (all objects are actors) communicating via messages. No other form of interaction exists between independent objects. This communication can be synchronous or asynchronous, but is usually asynchronous by default. Also, actors can create new actors. Actors can also maintain state, designating what their behavior should be in response to future incoming messages.
Yet another alternative was proposed in 1978 by Hoare, called Communicating Sequential Processes (CSP). While being similar in spirit to the Actor model (all objects are communicating processes), CSP differs in a few aspects. For instance, in the Actor model, the sender of a message needs to know the address of the receiver in order to send a message. In CSP, communication happens through channels. A sender need not be aware of the identity of the receiver at the other end of the channel. Also, in CSP, message-passing is in itself a synchronization mechanism: a sender cannot complete the process of sending a message unless the receiver is ready to accept one. In the Actor model, on the other hand, the `system' itself holds the messages that receivers are not yet ready to accept.
Please refer to http://en.wikipedia.org/wiki/Actor_model and http://en.wikipedia.org/wiki/Communicating_sequential_processes for introductory details.
It is pertinent to note that neither the Actor model nor CSP mandates conventional locks or globally-shared data! It is known in the industry that locking and lock-based synchronization are difficult to implement correctly, and even more difficult to scale up to multi-processor scenarios.
About 2-3 years ago, when I used Scala (http://www.scala-lang.org) for a few programs, I utilized the Actor model (provided in Scala through a standard library) to develop a framework for lock-less multi-threading that could utilize all the available processors. That was, of course, before I had heard of Akka. I employed it successfully for processing large amounts of data from the chemical databases of one of my clients.
Some time last year, I became aware of, and interested in, the Go (http://golang.org) language. These days, I use it for most new programs that I develop. Go implements CSP at the language level itself. I have found it to be very useful and convenient for developing multi-threaded programs that operate in a lock-less manner. I have recently written a utility program, using Go, to view and search for molecules in MDL's SDF file format.
Due to C and C++ not having an inherent notion of threading or message passing, there exists no straight-forward way of making ST's main program multi-threaded. We have to employ the Boost library or the MPI(Message-Passing Interface) library or (worse) POSIX threads directly or something similar, in order to wrap certain parts of the current main program. I remember that you looked at MPI (OpenMPI, in particular) several months ago
In my (rather limited) experience, both Boost libraries and MPI present steep learning gradients. They are large, their APIs are large, they are inherently complex, and the very amount of code they require us to write is significantly high. I am, in fact, more concerned about the effort needed to debug and maintain such code in the long run. In contrast, I have found Scala's Actor model and Go's CSP model to be welcome reliefs by way of their simplicity and ease of use.
Scala is a language implemented for the Java Virtual Machine (JVM). This, unfortunately, means that should we choose to employ Scala, our clients have to install Java in their Linux computers. Installation of Java is a decision outside our control, and is made by their respective IT departments.
Go, on the other hand, is similar to C and C++: it compiles to native code, and does not require a virtual machine. Consequently, we can give a new main program to our clients without asking them to install a JVM first. More importantly, we can easily call C code from within Go, making it feasible to re-use select parts of the current main program in a more flexible manner.
In view of the above, after several weeks of study and trials, I think that we should employ Go for making ST's main program multi-threaded. This applies to all areas of the main program that lend themselves to multi-threading, in general, and to the convergent synthesis problem, in particular.
I have written a small part of the new SDF import program in Go. Apart from a particular annoyance that I encountered (because of the absence of generics in Go), I should say that I was delighted to see how easily the program could be written multi-threaded to utilize both the processors in my computer. The effective increase in speed was about 50%, and I think that it can increase even further with some optimization.
The above may be unclear or confusing in parts. Please ask me questions if it is so. However, kindly note that this is an important decision that has to be made, and which cannot be reversed (that easily) later. Accordingly, the more the number of questions that we consider now, the better.

No comments: