2012-03-20

Linux distribution chosen!

I had promised to post an update towards the end of January. I did not. However, even the casual reader may have noticed my recent posts related to Debian Wheezy. Those must have served as hints. So, Debian Wheezy it is! I have finally settled on Debian Wheezy with KDE.

Perhaps apt's mechanisms suit my thinking. yum is very powerful, yet the rpm family could not win me over. In fact, at one point, I went close to going back to Arch Linux. However, it is often too cutting-edge — even for a development system.

At the same time, GUI played a non-trivial role in my decision. It also explains why Ubuntu – with its Unity desktop – did not survive in my computer. I felt GNOME to be too restrictive. Some people think that KDE has too many knobs and switches; that it is daunting. Again, perhaps its mechanisms suit my thinking.

With the decision made, I have removed the ISO files of well over a dozen distributions and the VMware images of about half-a-dozen. Peace!

2012-03-17

Debian Wheezy : updating Java alternatives

Debian Wheezy (or even Sid) defaults to Java 6. Originally, my computer had openjdk-6-jdk. I wanted to utilise the newer features in Java 7 such as higher performance and lower memory footprint, and try the Fork-Join framework. Accordingly, I installed openjdk-7-jdk. It updated the Debian alternatives for Java to point to the newer version. So far, so good!

Dependencies can upset the apple cart

Then, I installed Eclipse using apt-get. The version of Eclipse installed is 3.7.1, which is fine. However, it pulls in Java 6 as a dependency. I somehow did not pay attention to that. As the installation completed, I noticed several messages informing me that the alternatives for Java were being reset to Java 6. I bit my lip hard! I think that apt-get should explicitly warn the user if an installation downgrades a package, or more, due to dependencies.

Simple remedy

Fortunately, a simple remedy is possible. But before we begin, we should check the priorities with which both versions are installed. To check the same, issue the following command in a terminal.

> update-alternatives --display javac


javac - auto mode
  link currently points to /usr/lib/jvm/java-6-openjdk-amd64/bin/javac
/usr/lib/jvm/java-6-openjdk-amd64/bin/javac - priority 1061
  slave javac.1.gz: /usr/lib/jvm/java-6-openjdk-amd64/man/man1/javac.1.gz
/usr/lib/jvm/java-7-openjdk-amd64/bin/javac - priority 100
Current 'best' version is '/usr/lib/jvm/java-6-openjdk-amd64/bin/javac'.

Please note the numbers at the end of the full paths of javac. So, both Java 6 and Java 7 are installed, but Java 6 has a higher priority — 1061 to 100. It is, therefore, considered the best version. We can check where /etc/alternatives/javac points, too, for confirmation.

The remedy to apply itself utilises update-alternatives. In order to take care of all the important JDK components in one shot, I collected the commands into a shell script.

> cat up-java-alt.sh

#!/usr/bin/env sh
#
# Update Debian alternatives for Java.

update-alternatives --install \
        /usr/bin/java java \
    /usr/lib/jvm/java-7-openjdk-amd64/bin/java 1100

update-alternatives --install \
        /usr/bin/appletviewer appletviewer \
    /usr/lib/jvm/java-7-openjdk-amd64/bin/appletviewer 1100

update-alternatives --install \
        /usr/bin/apt apt \
    /usr/lib/jvm/java-7-openjdk-amd64/bin/apt 1100

update-alternatives --install \
        /usr/bin/extcheck extcheck \
    /usr/lib/jvm/java-7-openjdk-amd64/bin/extcheck 1100

update-alternatives --install \
        /usr/bin/idlj idlj \
    /usr/lib/jvm/java-7-openjdk-amd64/bin/idlj 1100

update-alternatives --install \
        /usr/bin/jar jar \
    /usr/lib/jvm/java-7-openjdk-amd64/bin/jar 1100

update-alternatives --install \
        /usr/bin/jarsigner jarsigner \
    /usr/lib/jvm/java-7-openjdk-amd64/bin/jarsigner 1100

update-alternatives --install \
        /usr/bin/javac javac \
    /usr/lib/jvm/java-7-openjdk-amd64/bin/javac 1100

update-alternatives --install \
        /usr/bin/javadoc javadoc \
    /usr/lib/jvm/java-7-openjdk-amd64/bin/javadoc 1100

update-alternatives --install \
        /usr/bin/javah javah \
    /usr/lib/jvm/java-7-openjdk-amd64/bin/javah 1100

update-alternatives --install \
        /usr/bin/javap javap \
    /usr/lib/jvm/java-7-openjdk-amd64/bin/javap 1100

update-alternatives --install \
        /usr/bin/jconsole jconsole \
    /usr/lib/jvm/java-7-openjdk-amd64/bin/jconsole 1100

update-alternatives --install \
        /usr/bin/jdb jdb \
    /usr/lib/jvm/java-7-openjdk-amd64/bin/jdb 1100

update-alternatives --install \
        /usr/bin/jhat jhat \
    /usr/lib/jvm/java-7-openjdk-amd64/bin/jhat 1100

update-alternatives --install \
        /usr/bin/jinfo jinfo \
    /usr/lib/jvm/java-7-openjdk-amd64/bin/jinfo 1100

update-alternatives --install \
        /usr/bin/jmap jmap \
    /usr/lib/jvm/java-7-openjdk-amd64/bin/jmap 1100

update-alternatives --install \
        /usr/bin/jps jps \
    /usr/lib/jvm/java-7-openjdk-amd64/bin/jps 1100

update-alternatives --install \
        /usr/bin/jrunscript jrunscript \
    /usr/lib/jvm/java-7-openjdk-amd64/bin/jrunscript 1100

update-alternatives --install \
        /usr/bin/jsadebugd jsadebugd \
    /usr/lib/jvm/java-7-openjdk-amd64/bin/jsadebugd 1100

update-alternatives --install \
        /usr/bin/jstack jstack \
    /usr/lib/jvm/java-7-openjdk-amd64/bin/jstack 1100

update-alternatives --install \
        /usr/bin/jstat jstat \
    /usr/lib/jvm/java-7-openjdk-amd64/bin/jstat 1100

update-alternatives --install \
        /usr/bin/jstatd jstatd \
    /usr/lib/jvm/java-7-openjdk-amd64/bin/jstatd 1100

update-alternatives --install \
        /usr/bin/native2ascii native2ascii \
    /usr/lib/jvm/java-7-openjdk-amd64/bin/native2ascii 1100

update-alternatives --install \
        /usr/bin/rmic rmic \
    /usr/lib/jvm/java-7-openjdk-amd64/bin/rmic 1100

update-alternatives --install \
        /usr/bin/schemagen schemagen \
    /usr/lib/jvm/java-7-openjdk-amd64/bin/schemagen 1100

update-alternatives --install \
        /usr/bin/serialver serialver \
    /usr/lib/jvm/java-7-openjdk-amd64/bin/serialver 1100

update-alternatives --install \
        /usr/bin/wsgen wsgen \
    /usr/lib/jvm/java-7-openjdk-amd64/bin/wsgen 1100

update-alternatives --install \
        /usr/bin/wsimport wsimport \
    /usr/lib/jvm/java-7-openjdk-amd64/bin/wsimport 1100

update-alternatives --install \
        /usr/bin/xjc xjc \
    /usr/lib/jvm/java-7-openjdk-amd64/bin/xjc 1100

Please note that we used a priority value of 1100, so that we can assign Java 7 a higher priority than that of Java 6. Now, we run the above script, and check again the alternatives status, and where /etc/alternatives/javac points.

> update-alternatives --display javac


javac - auto mode
  link currently points to /usr/lib/jvm/java-7-openjdk-amd64/bin/javac
/usr/lib/jvm/java-6-openjdk-amd64/bin/javac - priority 1061
  slave javac.1.gz: /usr/lib/jvm/java-6-openjdk-amd64/man/man1/javac.1.gz
/usr/lib/jvm/java-7-openjdk-amd64/bin/javac - priority 1100
Current 'best' version is '/usr/lib/jvm/java-7-openjdk-amd64/bin/javac'.

Enjoy Java 7 again! Don't forget to change the default JVM path in Eclipse, though.

What about the man pages installed as slaves? That part is left as an exercise :-)

2012-03-08

Convergent synthesis, finally!

The legacy C++ version of my chemistry product has finally gained the capability to do convergent synthesis — in a primitive form for now.

What is convergent synthesis?

Consider a complex molecule that we wish to synthesise. In the usual method, the synthesis steps proceed linearly. Suppose that the following is the sequence of synthesis steps, where Goal designates the molecule to synthesise.

A → B → C → D → E → F → G → H → I → J → Goal

The above sequence has 10 steps in the process. We start with a simple, available molecule A. Presumably, a functional group is either added, replaced or deleted at each step. [Reality includes several more dimensions, but let us keep the discussion simple.] While easy to comprehend, the biggest problem with this is the effective yield of the route. For the purposes of discussion, let us assume an average yield of 85% per step. With 10 steps, the effective yield is less than 20%!

Contrast this with the following scheme.

P → Q → R ⌉
          | → Goal
S → T → U ⌋

In this case, we have two independent paths leading to moderately complex molecules R and U. Then, these two paths converge to give rise to a more complex molecule, which in this case is our goal molecule. The expectation is that since R and U are only moderately complex, they can be independently synthesised in a couple of steps each. The effective yield for the convergent route, then, is about 44%! This is, obviously, much more attractive.

How does it work in retro-synthesis?

My product actually does retro-synthesis, i.e., it starts with the goal molecule, and constructs the steps in reverse order. At each step, the product molecule is broken into reactants. In most scenarios, the coreactant is a trivial molecule; or, it is directly available for purchase from a company like Sigma Aldrich.

If we wish to take advantage of convergent synthesis, on the other hand, how we break a product molecule into possible sets of reactants becomes a matter of extreme significance. In the step R + U → Goal, effective convergence is possible if and only if R and U have about the same degree of complexity.

What next?

The immediate challenge is to locate such reactions, and build a repertoire of them. Another, of course, is to be able to resolve a product molecule to utilise one such! Our chances depend on being able to identify a reasonably central atom that is suitable for initiating the cleavage of the product molecule. The algorithms have to be refined to exploit this new capability, as well!

2012-03-07

Difficult decision - 2

Firstly, I am surprised at the amount of traffic my previous post has generated. I had over a thousand visitors within a few hours: redirected from Google, Reddit, Hacker News, etc.! I wonder how may of them would have read the post had it been the opposite way, i.e., a declaration that I was adopting Go for the next version of my chemistry product? It appears to demonstrate an intriguing aspect of human nature!

Plug-in

Several people have suggested various ways in which out-of-process plug-ins are better. Some of these suggestions arose probably because of me not describing a plug-in adequately. I tried to remedy the situation in individual responses. I am collecting some of those points hereunder.

I am looking at plug-ins for at least the following benefits. In all cases, the said plug-in could potentially be supplied by me, a third-party developer or the user herself.

  • Substitute an algorithm for another.
  • Substitute an algorithm implementation for another.
  • Add a new algorithm not originally shipped with the product.
  • Add a calculator or a transformer as a hook in a particular processing step.

Performance

My application has to process millions (sometimes tens of millions) of molecules per run. A large number of them are processed by the proposed plug-ins. The number reduces with each advancing stage of processing, owing to elimination in each stage. The load is, hence, lower towards the tail of the work flow. But, upstream plug-ins are invoked for most molecules.

An out-of-process plug-in will require the following steps for communication:

  • serialisation of input in the main program,
  • deserialisation of input in the plug-in,
  • serialisation of output in the plug-in, and
  • deserialisation of output in the main program.

The above steps are in addition to the unavoidable protocol handshake for each request. Evidently, the larger the amount of data that needs to be exchanged between the main program and the plug-in, the slower the above process will be. Let us look at the data structure that will get exchanged the highest in my application, viz., Molecule. It has the following information, at a minimum:

  • a unique identifier,
  • a list of atoms, where each atom has:
    • a unique identifier,
    • element type,
    • spatial coordinates,
    • net charge,
    • stereo configuration,
    • number of implicit H atoms attached to it,
    • aromaticity,
    • list of rings it is a part of,
    • whether it is a bridgehead,
  • a list of bonds, where each bond has:
    • a unique identifier,
    • the atoms it joins,
    • order: single, double, triple, aromatic, …,
    • stereo configuration,
    • aromaticity,
    • list of rings it is a part of,
  • a list of rings with their own properties,
  • a list of components,
  • a list of functional groups,
  • … .

It may be possible to have a protocol to allow a plug-in to declare the subset of the above data that it actually needs. However, checking the protocol and selectively serialising the data has a cost itself.

On the contrary, an in-memory plug-in accesses the object using a pointer, with just a transfer of ownership but no transfer of data.

Manageability

External plug-ins also raise the subject of their life cycle management and resolution. The questions that need to be addressed include the following.

  • When is a plug-in activated? Together with the main program? On demand?
  • Should plug-ins die with the main program? How do we handle the main program terminating abnormally?
  • How long should a plug-in continue to run, if idle?
  • How should zombie plug-ins be handled?
  • If socket-based, how should port numbers be managed?
  • How should multiple plug-ins providing the same capability be resolved? How about versions?

While in-memory plug-ins do not automatically solve all the above, they do eliminate some of them easily.

Interesting Options Suggested

An interesting option that was suggested was to package the Go binary distribution as part of my product. Then, when a plug-in is downloaded, the main program itself could be re-compiled and re-linked to include the plug-in. This is a possibility. Some infrastructure code has to be written, though.

Another family relates to embedding a scripting language. This is another possibility. However, it is neither fair nor acceptable to force third-party plug-ins to have to always suffer (the relatively) lower performance because of the scripting language itself. This may become very important if the plug-in is intended for use in the initial stages of the work flow.

The Need for a Java API

Independent of the above, there is no straight-forward way to provide a Java API on the top of an application written in Go. This issue remains unaddressed.

2012-03-01

Difficult decision

Considering the number of hits this post continues to have on a weekly basis, I should point out that a follow-up post can be found at A Difficult Decision - 2.

I like Go as a language. It is a rare, good balance between simplicity, power and expressiveness. I was so impressed with it as to begin the development of the next version of my chemistry product in it. There were minor hiccups, but I made progress, and wrote as many as about 12,000 lines of code (includes comments).

Then, I paused.

Now, about two months later, I have decided to shift the development of the new version of my product to … Java. If that sounds anti-climactic, so it is!

There are two significant reasons for this decision.

Plug-ins

The intended users of my product (and its API) are organic chemists and cheminformatics programmers. In practice, several organic chemists neither program themselves nor have programming assistants.

My product has several basic capabilities built into it. These are exposed as callable or configurable algorithms. What I also want to do is enable the user to extend the capabilities of the product through plug-ins.

In the Java land, this is well-studied problem, with more than one established way of solving. In the Go scenario, on the other hand, this is not straight-forward. Go programs are linked statically. There are requisites to enable plug-ins.

What would I need to do?

  1. Supply the statically-linked full product.
  2. Supply *.a files of my product's packages, and *.o files of the driver part.

What would my customer need to do?

  1. Install gcc and family and their dependencies. [Once; update: not needed after Go1. Thanks Andrew Gerrand!]
  2. Install (and maintain) the Go development environment. [Once, at least after Go1.]
  3. Download the plug-in package file .a, and place it in a designated directory.
  4. Build and install the application.[1]

Why will this not work?

It could, with programmers. It, unfortunately, will not, with chemists. For many chemists, the exertion of supplying input to a computer program and reading its output using a suitable viewer, is a considerable condescension. Should I ask them to compile code, and that too for each plug-in, typical chemists will laugh me out of court!

The API Consumption Issue

Most foot soldier programmers at pharmaceutical companies, biotechnology companies and informatics services providers have graduated in the last ten years, or so. Most of them learned Java as their first language, and have used it the most. Naturally, they tend to gravitate towards solutions with a published Java API.

Chemistry product companies acknowledge this. Almost all of them provide a Java API for their products and libraries.[2] Those that have remained with a non-Java API are certainly not the leaders today!

Therefore, for me to appeal to those programmers, I must provide a fully capable Java API. My prospects of commercial success depend on that factor in no minor way.

Conclusion

Accordingly, I have shifted the product to Java. My team and I have begun re-designing and re-implementing the legacy product — this time to suit a more Java-like style; to appeal to Java audience!



[1] goinstall does not help, since it needs source code to be available. Plug-in authors may not be willing to publish source code.

[2] If they cannot provide a proper Java API, they at least provide a JNI wrapper.