One of Many Worlds

Go, GitHub and Travis : a small lesson in dependencies

2015-03-02T17:23:00.000+05:30

Recently, I ran into a small, but interesting, issue in a new open-source project that I am working on: RxnWeaver. It is being developed in Go. Inspired by a few other projects that I follow, I set up Travis CI for this project.

I do the development in my fork of the project, before raising periodic (or need-based) pull requests to the main repository.

In a particular commit, I happened to add a few exported constants to a package (let us call this Package A) in the repository. In the next commit, I added code that depended on some of those constants, to a different package (let us call this Package B) in the repository. As usual, I pushed the commits to my GitHub fork after making sure that the code was formatted with go fmt and that the tree builds without errors. I raised a pull request to the main repository, and the fun began!

Travis CI reported failure saying that the build did not complete successfully. A little investigation revealed the cause. In Package B, references to Package A use the official github.com/RxnWeaver/RxnWeaver import paths. The official version of the package there, however, has the old set of constants that does not include the new and required ones. Therefore, Package B could not be built.

The commit, though, that was intended to update Package A was part of the same pull request!

Of course, it was easy to fix, but the simple lesson is: if you do not want to have to use the command line and some git trickery, do not forget to raise a pull request (and have it merged) for each piece that could be used as a dependency in a different Go package!

On Effects

2014-03-04T12:36:00.000+05:30

Background

Today, I happened to watch a panel discussion from Lang.NEXT 2012. It featured Anders Hejlsberg, Martin Odersky, Gilad Bracha and a certain Peter Alvaro, and was anchored by the inimitable Erik Meijer. At the end of what turned out to be a very interesting discussion, a Q&A session was hosted. The last few questions relate to effects-checking at the language level. Odersky responded by saying that effects-checking in static type systems is currently at about the same primitive level as where static typing itself was in Pascal: clunky and cumbersome!

Watching that video finally prompted me to collect my thoughts into a written note.

The Beginning

The first time I pondered the question of effects was around 1994-5, when I started working on C++ programs with sizes in the range of a hundred thousand to a million lines of code. A prominent stimulus was the const annotation on methods. The C++ compiler was capable of tracing changes to the current object made by a given method. In particular, I took note of the transitive nature of const. This transitive nature was simultaneously useful and painful. The const-ness problem in C++ is well known and well documented as well.

Curiously, a method annotated const could still call external code, perform I/O, etc., as long as it did not mutate the current object. Often, I wanted a const method to not get too intelligent and perform unspecified operations. This was particularly true of third-party code without accompanying source. But, C++ provided no mechanism to declare and enforce any such.

Unmet Need

I went on to write programs for IBM 360/370 family of mainframe operating systems, a few flavours of UNIX, Windows NT, etc. Over the years, numerous times, I felt the same need for better guarantees on methods (and functions/procedures). I found it interesting that none of the languages that I worked with, in any of those environments, provided a solution to this problem.

Every once in a while, I would think of effects. Those were mostly unstructured thoughts, though. In addition, I was a typical applications programmer, with no formal background in computer science and programming languages. Having moved from theoretical Physics into programming, I often tried drawing analogies and parallels. Some of them were useful – sometimes and to some extent – but would always breakdown eventually.

Explicit Effects

By 1998-9, I had begun developing a better appreciation for dynamically-typed languages. Not weakly-typed languages such as Perl and Tcl, but strongly-typed ones such as Smalltalk, Python and (later) Ruby. I had accidentally come across the Smalltalk Blue Book by Goldberg and Robson. It opened my eyes to several new windows and doors! I employed Python and Ruby in a variety of projects, with great results for my clients. In the process, for a few years, I did not explore statically-typed languages. Nonetheless, the issue of effects surfaced time and again, particularly as the sizes of code bases and teams increased.

I returned to a large (> a million LoC) C++ project in 2002, and that work stirred my thoughts on effects yet again. Based on my experiences, I began collecting a wish list of the kinds of effects that I wanted the compiler to track. Towards, that I began comparing my thoughts to the facilities provided by some of the languages that I used or became aware of.

Java

Java's checked exceptions force a method to either handle a thrown exception or re-throw the same, and declare so in the method signature. While no other effects can be declared, a consumer knows from the signature that such a method may not return a value, but may throw one of the specified exceptions.

Haskell

I came across Haskell in 2003. I found the basics easy enough to follow, and wrote small exercise programs to gain some familiarity with it. Those days, there were not many easy tutorials for beginners, requiring some research into the scant documentation rather frequently. As I read about Haskell, I found three interesting aspects standing out [1]:

all Haskell functions are technically unary,
its system of type classes, and
its effects system.

The latter, of course, is relevant to the current discussion. Haskell does not require us to say anything specific about a pure function. On the other hand, when a function is not pure, Haskell requires us to utilise an appropriate type (usually a monad) to indicate the specific manner in which the function causes side effects. This allows for user-defined monads to specialise the kind of effects caused by a function. Once defined and used, these monads are utilised by Haskell's powerful type system to ensure consistency of use across the program.

Much later, I happened to watch the video of a talk by Erik Meijer, in which he remarked there are many ways for a function to be impure, but there is only one way to be pure. And, the dots connected!

D

D follows the approach of C++, but takes it further. Unlike C++'s const, a pure function in D must be free of side effects. This is a much stronger guarantee, and helps significantly. However, notice the difference between Haskell's philosophy and D's: functions (and methods) in D are assumed impure by default. And, thus, pure functions (and methods) have to be explicitly marked pure.

Nimrod

An interesting variation can be found in Nimrod. It provides some pragmas to specify effects[2]. In particular, we can specify the possible exceptions thrown by a proc or a method. If it does not declare any exceptions, it is assumed to throw the base exception type. To avoid that, it has to expressly declare an empty list of exceptions.

There are plans to implement read and write tracking in Nimrod. In addition, an interesting feature is the capability to tag a proc or a method with some types. The meaning of those types is ascribed by the user; Nimrod doesn't appear to care! However, once specified, these tagged types are tracked by the compiler analogously to how exceptions are tracked. Thus, it provides an expressive mechanism to introduce user-defined effect types as long as they behave similarly to exceptions.

Evolution of My Thoughts

The numerous projects that worked on shaped the development of my own thoughts on effects. Apart from working on huge assembler, COBOL, PL/I and Rexx code bases on IBM mainframes, I worked on large projects that used C, C++, Java, Python, Ruby, etc. in a wide variety of application domains. Particular combinations of application domains and languages sometimes led to specific realisations.

Tracking Effects

I believe that effects tracking can be effectively implemented in both statically-typed languages and dynamically-typed ones. Type systems for effects appear to be orthogonal to those for values. Accordingly, the following discussion does not distinguish between the static vs. dynamic nature of types for values. Similarly, it does not distinguish between object-oriented and non-object-oriented languages. It does, on the other hand, assume that there is an ahead-of-time or just-in-time compilation phase — i.e. parsing the source should not result in an AST that is directly executed immediately.

Compiler-Defined Effects

An analysis of the program by the compiler is necessary for any effects system to be useful. The signature of each function or method in the program has to be verified against inferred effects of that function or method. All deviations have to be marked as errors, and the compiler should refuse to compile such. Effects should be annotated as a possible combination of:

mutates: mutates object state,
mutates_params: mutates one or more input parameters passed by reference,
reads: reads input from the world: heap, message queues, files, network, etc.,
writes: writes output to the world: heap, message queues, files, network, etc.,
tainted: invokes untrusted external code,
recursive: may not return due to self recursion,
i_recursive: may not return due to mutual or more indirect recursion, and
throws: may throw one or more exceptions.

A function or method with none of the above effects is considered pure: it is a mathematical function!

User-Defined Effects

User-defined effects are second class citizens. They are tracked like compiler-defined ones, but the compiler itself cannot relate to the meaning of such effects.

Propagation of Effects

throws and user-defined effects are mutable. It should be possible to handle them, and stop their propagation. Whether the resolution of such handled exceptions is nominal, involves subtyping, etc., is dependent on the type system of the language. Except for those, all other effects are fundamentally transitive in nature. Each effect propagates up the call hierarchy from where it occurs.

This mandates run-time compilation in the cases of languages supporting fully separate compilation of modules/packages/…. Cross-boundary passing of lambdas and methods to higher-order functions and methods necessitates dynamic compiler checks at run time. Violations of effects guarantees should lead to a designated run-time exception that cannot be handled.

This can yet be avoided should it be possible to perform a whole-program analysis upon dynamic linking. But, may be that opens a different Pandora's box!

[1] At that time, I could not comprehend the machinery behind those. Not that I fully comprehend it now, too; my current understanding is only marginally better! ↩

[2] http://nimrod-lang.org/manual.html#effect-system ↩

Another go at Go ... failed!

2014-01-08T12:38:00.000+05:30

After a considerable gap, a gave Go another go!

The Problem

As part of a consulting engagement, I accepted a project to develop some statistical inference models in the area of drug (medicine) repositioning. Input data comprises three sets of associations: (i) between drugs and adverse effects, (ii) between drugs and diseases, and (iii) between drugs and targets (proteins). Using drugs as hub elements, associations are inferred between the other three kinds of elements, pair-wise.

The actual statistics computed vary from simple measures such as sensitivity (e.g. how sensitive is a given drug to a set of query targets?) and clustering coefficients of the multi-mode graph, to construction of rather complex confusion matrices, computation of measures such as Matthews Correlation Coefficient, to construction of generalised profile vectors for drugs, diseases, etc. Accordingly, the computational intensity varies considerably across parts of the models.

For the size of the test subset of input data, the in-memory graph of direct and transitive associations currently has about 15,000 vertices and over 14,000,000 edges. This is expected to grow by two orders of magnitude (or more) when the full data set is used for input.

Programming Language

I had some temptation initially to prototype the first model (or two) in a language like Ruby. Giving the volume of data its due weight though, I decided to use Ruby for ad hoc validation of parts of the computations, with coding proper happening in a faster, compiled language. I have been using Java for most of my work (both open source as well as for clients). However, considering the fact that statistics instances are write-only, I hoped that Go could help me make the computations parallel easily[1].

My choice of Go caused some discomfort on the part of my client's programmers, since they have to maintain the code down the road. No serious objections were raised nevertheless. So, I went ahead and developed the first three models in Go.

Practical Issues With Go

The Internet is abuzz with success stories involving Go; there isn't an additional perspective that I can add! The following are factors, in no particular order, that inhibited my productivity as I worked on the project.

No `Set` in the Language

Through (almost) every hour of this project, I found myself needing an efficient implementation of a set data structure. Go does not have a built-in set; it has arrays, slices and maps (hash tables). And, Go lacks generics. Consequently, whichever generic data structure is not provided by the compiler can not be implemented in a library. I ended up using maps as sets. Everyone who does that realises the pain involved, sooner than later. Maps provide uniqueness of keys, but I needed sets for their set-like properties: being able to do minus, union, intersection, etc. I had to code those in-line every time. I have seen several people argue vehemently (even arrogantly) in golang-nuts that it costs just a few lines each time, and that it makes the code clearer. Nothing could be further from truth. In-lining those operations has only reduced readability and obscured my intent. I had to consciously train my eyes to recognise those blocks to mean union, intersection, etc. They also were very inconvenient when trying different sequences of computations for better efficiency, since a quick glance never sufficed!

Also, I found the performance of Go maps wanting. Profiling showed that get operations were consuming a good percentage of the total running time. Of course, several of those get operations are actually to check for the presence of a key.

No `BitSet` in the Standard Library

Since the performance of maps was dragging the computations back, I investigated the possibility of changing the algorithms to work with bit sets. However, there is no BitSet or BitArray in Go's standard library. I found two packages in the community: one on code.google.com and the other on github.com. I selected the former both because it performed better and provided a convenient iteration through only the bits set to true. Mind you, the data is mostly sparse, and hence both these were desirable characteristics.

Incidentally, both the bit set packages have varying performance. I could not determine the sources of those variations, since I could not easily construct test data to reproduce them on a small scale. A well-tested, high performance bit set in the standard library would have helped greatly.

Generics, or Their Absence

The general attitude in Go community towards generics seems to have degenerated into one consisting of a mix of disgust and condescension, unfortunately. Well-made cases that illustrate problems best served by generics, are being dismissed with such impudence and temerity as to cause repulsion. That Russ Cox' original formulation of the now-famous tri-lemma is incomplete at best has not sunk in despite four years of discussions. Enough said!

In my particular case, I have six sets of computations that differ in:

types of input data elements held in the containers, and upon which the computations are performed (a unique combination of three types for each pair, to be precise),
user-specified values for various algorithmic parameters for a given combination of element types,
minor computational steps and
types (and instances) of containers into which the results aggregate.

These differences meant that I could not write common template code that could be used to generate six versions using extra-language tools (as inconvenient as that already is). The amount of boiler-plate needed externally to handle the differences very quickly became both too much and too confusing. Eventually, I resorted to six fully-specialised versions each of data holders, algorithms and results containers, just for manageability of the code.

This had an undesirable side effect, though: now, each change to any of the core containers or computations had to be manually propagated to all the corresponding remaining versions. It soon led to a disinclination on my part to quickly iterate through alternative model formulations, since the overhead of trying new formulations was non-trivial.

Poor Performance

This was simply unexpected! With fully-specialised versions of graph nodes, edges, computations and results containers, I was expecting very good performance. Initially, it was not very good. In single-threaded mode, a complete run of three models on the test set of data took about 9 minutes 25 seconds. I re-examined various computations. I eliminated redundant checks in some paths, combined two passes into one at the expense of more memory, pre-identified query sets so that the full sets need not be iterated over, etc. At the end of all that, in single-threaded mode, a complete run of three models on the test set of data took about 2 minutes 40 seconds. For a while, I thought that I had squeezed it to the maximum extent. And so thought my client, too! More on that later.

Enhancement Requests

At that point, my client requested for three enhancements, two of which affected all the six + six versions of the models. I ploughed through the first change and propagated it through the other eleven specialised versions. I had a full taste of what was to come, though, when I was hit with the realisation that I was yet working on Phase 1 of the project, which had seven proposed phases in all!

Back to Java!

I took a break of one full day, and did a hard review of the code (and my situation, of course). I quickly identified three major areas where generics and (inheritance-based) polymorphism would have presented a much more pleasant solution. I had already spent 11 weeks on the project, the bulk of that going into developing and evaluating the statistical models. With the models now ready, I estimated that a re-write in Java would cost me about 10 working days. I decided to take the plunge.

The full re-write in Java took 8 working days. The ease with which I could model the generic data containers and results containers was quite expected. Java's BitSet class was of tremendous help. I had some trepidation about the algorithmic parts. However, they turned out to be easier than I anticipated! I made the computations themselves parts of formally-typed abstract classes, with the concrete parts such as substitution of actual types, the user-specified parameters and minor variations implemented by the subclasses. Conceptually, it was clear and clean: the base computations were easy to follow in the abstract classes. The overrides were clearly marked so, and were quite pointed.

Naturally, I expected a reduction in the size of the code base; I was not sure by how much, though. The actual reduction was by about 40%. This was nice, since it came with the benefit of more manageable code.

The most unexpected outcome concerned performance: a complete run of the three models on the test set of data now took about 30 seconds! My first suspicion was that something went so wrong as to cause a premature (but legal) exit somewhere. However, the output matched what was produced by the Go version (thanks Ruby), so that could not have been true. I re-ran the program several times, since it sounded too good to be true. Each time, the run completed in about 30 seconds.

I was left scratching my head. My puzzlement continued for a while, before I noticed something: the CPU utilisation reported by /usr/bin/time was around 370-380%! I was now totally stumped. conky showed that all processor cores were indeed being used. How could that be? The program was very much single-threaded.

After some thought and Googling, I saw a few factors that potentially enabled a utilisation of multiple cores.

All the input data classes were final.
All the results classes were final, with all of their members being final too.
All algorithm subclasses were final.
All data containers (masters), the multi-mode graph itself, and all results containers had only insert and look-up operations performed on them. None had a delete operation.

Effectively, almost all of the code involved only final classes. And, all operations were append-only. The compiler may have noticed those; the run-time must have noticed those. I still do not know what is going on inside the JRE as the program runs, but I am truly amazed by its capabilities! Needless to say, I am quite happy with the outcome, too!

Update: As several responses (both here and on Hacker News) stated, Java's multi-threaded GC appears to be primary reason for the utilisation of all the processor cores.

Conclusions

If your problem domain involves patterns that benefit from type parameterisation or[2] polymorphism that is easily achievable through inheritance, Go is a poor choice.
If you find your Go code evolving into having few interfaces but many higher-order functions (or methods) that resort to frequent type assertions, Go is a poor choice.
Go runtime can learn a trick or two from JRE 7 as regards performance.

These may seem obvious to more informed people; but to me, it was some enlightenment!

[1] I tried Haskell and Elixir as candidates, but nested data holders with multiply circular references appear to be problematic to deal with in functional languages. Immutable data presents interesting challenges when it comes to cyclic graphs! The solutions suggested by the respective communities involved considerable boiler-plate. More importantly, the resulting code lost direct correspondence with the problem's structural elements. Eventually, I abandoned that approach.↩

[2] Not an exclusive or.↩

Captions on Indian trucks - an unexpected lesson

2013-10-16T15:00:00.003+05:30

Several years ago, when I lived in Bengaluru, my company's offices used to be in Electronics City (for a few years). Owing to the distance and the disgusting volume of traffic, I used to leave for the office rather early, around 07:00.

In those early hours, long-distance trucks were permitted to travel through the city. As I overtook them, I used to read the captions written on those trucks. A particular caption was very, very common on trucks coming from the North: ``burI nazar vAlE, tErA muH kAlA" (roughly ``oh you who cast an evil eye on me, your face shall become blackened").

After a while, I grew so familiar with it, that I usually read only the first word before turning my attention to the next truck.

On a particular day, I had this truck right ahead of me, when we stopped at a traffic signal. The caption began with the usual ``burI", and I almost turned in another direction, but something pulled my attention back. The caption read: ``burI nazar vAlE, tErA bhI bhalA hO" (roughly ``oh you who cast an evil eye on me, I wish you well in spite of that").

I was stunned! It took me a while to digest that. Am I equal to that spirit? I don't think so. Nonetheless, it has had a rather mysterious affect on my thinking!

Ring Detection - 1

2013-08-14T14:55:00.001+05:30

In this two-part series, we look at how Ojus Chemistry Toolkit (OCT) currently implements ring detection. In this part, detection of rings and ring systems is described.

Preparation

N.B. OCT's molecule reader converts any hydrogen atoms that it encounters into implicit hydrogens, i.e., the list of atoms and bonds in an OCT molecule never contains a hydrogen atom. Similarly, the number of hydrogens attached to an atom is determined looking at the atom's valence and any charge specified for it.

N.B. The Frerejacque value of the molecule is checked to see if there are any cycles at all, before the following procedure is employed.

When we are interested in finding cycles (and only cycles), we should not look at open branches — those that contain only leaves. Accordingly, we remove them. In reality, all of this computation happens on temporary data structures, not the actual molecule itself.

Removing Terminal Chains

Remember terminal atoms from this old article? Quickly, a terminal atom is one which has a single neighbour. We remove all such atoms. However, removing a terminal atom could make its sole neighbour a terminal atom itself! We have to remove it, too. But, then, removing it … ad infinitum. Here is the pseudocode.

    var removing := true.

    while (removing) {
        removing := false.

        for each atom `a` {
            if `a` is a terminal {
                remove `a`.
                removing := true.
            }
        }
    }

Note that removing an atom entails removing all the bonds in which it participates, etc. Thus, by the time the outer loop exits, all terminal chains will have been removed.

How Many Rings?

One Ring or More?

At this point, if all the atoms have exactly two neighbours, there is only one ring in the molecule. Why? More than one ring implies that a second ring is either fused to the first one, or is independent of it, but connected through a linking chain of atoms. In either case, there should be at least one atom that is a junction — it should have at least three bonds.

Cycle Detection

Case of One Ring

In the case of the molecule having only one ring, we mark that ring as the sole ring. We also create a single ring system with that sole ring. And, we are done!

Case of Multiple Rings

Detection of two or more rings present in the molecule is (rather) evidently more complex than that of a single ring. Over the last decade, or so, I have written four or five distinct algorithms to detect multiple rings in a molecule. All of them employed a recursive depth-first approach. The recursive approach made certain parts of the algorithm easier and clearer, while complicating certain others.

This time, though, I chose to employ a breadth-first approach. The outline is as follows.

    var candidates := new list of paths.
    var path := new list of atoms.

    var a := the first non-junction atom in the molecule.
    if no such atom exists {
        a := the first atom in the molecule.
    }
    path := `path` with `a` appended.
    candidates := `candidates` with `path` appended.

    while (`candidates` is not empty) {
        var test-path := fetch the first path in `candidates`.
        try-path `test-path`.
    }

As you can see, I have a queue into which I inject all candidate paths. A candidate path is potentially incomplete, and may need to be extended by adding a neighbour of the last atom in it, should one such exist. But that is part of processing the path.

Process a Test Path

    var current := last atom in `test-path`.

    for each neighbour `nbr` of `current` {
        if `nbr` was visited already in `test-path` {
            var ring := path from previous occurrence of `nbr` to `current`.
            if `ring` is valid {
                record `ring`.
            }
            continue loop.
        }

        var new-path := `test-path` with `nbr` appended.
        candidates := `candidates` with `new-path` appended.
    }

Thus, we add each path extension to the queue of candidates, at each atom that we encounter. Now, given a candidate ring, how do we validate it? First, the easy case: if the candidate has only three atoms, then it is certainly a genuine ring. Good! But, what if it has more than three?

Preliminary Validation

    if size of `test-path` is 3 {
        answer `true`.
    }

    for each atom `a` in `test-path` {
        if more than 2 neighbours of `a` exist in `test-path` {
            /* Must be a pair of rings fused by a single bond. */
            answer `false`.
        }
    }

    /* Must be a simple closed curve. */
    answer `true`.

Ring System Detection

Once we have recorded all rings, we sort them by size. The next task is to partition this set of rings into equivalence classes of ring systems. Two given rings belong to the same ring system if they have at least one bond or at least one atom in common.

Partitioning

The actual task of partitioning the set of rings is quite straightforward.

    var ring-systems := new list.

    outer:
    for each recorded ring `r` {
        for each ring system `rs` {
            if `r` and `rs` share a bond or
               `r` and `rs` share an atom {
                mark `r` as belonging to `rs`.
                rs := `rs` with `r` appended.
                continue `outer`.
            }
        }

        var new-rs := new ring system.
        new-rs := `new-rs` with `r` appended.
        ring-systems := `ring-systems` with `new-rs` appended.
    }

Conclusion

We have, now, identified the rings in the molecule, and have partitioned the set into ring systems. However, this set could contain potentially spurious rings. Eliminating the false candidates is a surprisingly more difficult procedure. We shall look into it in the next post!

C++11's strongly-typed enums

2013-04-30T12:22:00.000+05:30

C++11 provides strongly-typed enums. These new enums are no longer simple sets of integers; they are classes in their own right. And, their instances are objects, too!

In the following example, we define several enumerated values for possible stereo configurations of a chemical bond.

enum class BondStereo : int8_t
{
    NONE
    , UP //!< Bond points up from the plane.
    , DOWN //!< Bond points into the plane.
    , UP_OR_DOWN //!< Bond may point up from or into the plane.
    , E //!< Bond may be single or double, and has entgegen configuration.
    , Z //!< Bond may be single or double, and has zusammen configuration.
    , CIS //!< Double bond having cis configuration.
    , TRANS //!< Double bond having trans configuration.
    , UNSPECIFIED //!< Bond has unknown stereo geometry.
};

Note that the above enum class BondStereo has int8_t as its underlying data type. C++11 allows you to specify the precise integral data type that should be used. In turn, that allows us to choose specific types to best suit the required ranges of the enumerated values.

However, with this felicity comes an inconvenience. Since the instances of these enums are no longer simple integers, they cannot be used as such! The following is a simple and generic way of obtaining the underlying integral value of a given instance.

template <typename E>
auto enumValue(const E e) -> typename std::underlying_type<E>::type
{
    return static_cast<typename std::underlying_type<E>::type>(e);
}

Note the use of auto and the trailing return type declaration. This is needed, since we do not know in advance the exact underlying type of a given instance.

Great, you say, but which headers should I include to be able to use this functionality? Well, that is left as an exercise :-) Enjoy!

Operational Disaster

2013-04-23T11:57:00.002+05:30

My return to blogging did not turn out to be a real return, evidently! Here is an update after several months, again!

Operational Disaster

An unprecedented, peculiar and curious sequence of quick events took place in the middle of March. I was investigating the possible use of VirtualBox image as a distribution mechanism for my organic synthesis planning product. My scientist created a CentOS 6.4 image with the program and its dependencies, and shared it. I made a copy of it, and imported the same into VirtualBox in my Windows 8 computer. It worked well, as intended. Thus far, the experiment was successful!

To ease testing the program in the VirtualBox environment, I mounted my primary data virtual disk in the CentOS image. The testing itself was successful. It appeared as if this mechanism was proving to be technologically sound and convenient as well.

Disaster Strikes!

Some more testing later, I decided to re-test the entire procedure. I removed the CentOS image from VirtualBox. However, it complained of an existing image when I tried to import the appliance the next time. Since the old image was no longer needed, I asked VirtualBox to delete it from the disk. It did so!

I ran through the procedure again, with the same initial success. I was quite satisfied with this approach. Then, exactly as before, I tried mounting my primary data virtual disk in this new CentOS image. Navigating to the drive where it was located, I was astonished to not find it there! Concerned, I opened Windows explorer, and navigated to the said drive. Indeed, the .vmdk file was gone!

What Had Happened?

As of the time that I deleted the CentOS image, the mounted external virtual disk had an entry in its /etc/fstab. So, when VirtualBox deleted the image, it deleted the mounted disk's file as well! VirtualBox claims that mounted disks that are used by other images are not deleted. How did this deletion happen, then?

The answer is straight-forward: the mounted disk was not used by any other VirtualBox image! It was the home partition of another Linux image, but that was a VMware image! So, VirtualBox was indeed true to its promise. But, my data was lost!

More Disaster!

I was distressed at having lost my home partition in this unforeseen manner. The lost virtual disk was my primary data drive. The host Windows home folder did not have anything of consequence, except in the Downloads folder. Fortunately, I had a backup from January. It was stored in a separate NTFS drive in the laptop, as well as an external USB disk.

So, losing no time, I proceeded to restore the contents from that backup. It was an encrypted backup. Since I always name my top-level folders with a common prefix, I asked for the files to be restored to my Windows home folder. Towards the very end of the unpacking exercise, something went wrong, and the system froze! I gave it several minutes to recover, but to no avail. There was no disk activity; no sign of life; even the Caps Lock key stopped responding after a while. I waited patiently ... with my fingers crossed, for over half-an-hour. The computer had really frozen!

Disappointed, I switched the power off. With considerable trepidation, I switched it on a few seconds later. Truly enough, the system froze while booting! I felt really let down. I tried restarting the computer a few times, with the same inevitable result — the login screen wouldn't appear. My entire Windows had become unusable! Two strokes in a row.

Rescue Attempt

I used my wife's computer to read about rescuing Windows. There was a Windows rescue partition (drive) in the hard disk. I tried to restore Windows using that rescue partition, following the instructions that I found in Microsoft's Web site. Alas! My earlier impulse upgrade to Windows 8 haunted me unpleasantly there. The rescue image was that of Windows 7, which came pre-installed with the HP hardware. When I attempted a rescue, it kept informing me that I had a more recent version of Windows installed, and that it couldn't rescue it!

For an hour, or so, I was quite dumbstruck. But then, life has to go on, of course! I contemplated possible courses of action. And finally chose what I knew best.

Go The Linux Way!

I burned Kubuntu 12.04 LTS to a DVD, and booted into the Live image. After making sure that all essential hardware worked as expected, I installed it to the hard disk as the only operating system. During the installation, I opted for the proprietary firmware for the Broadcom 4313GN wireless card, etc., to be downloaded and installed. Everything went smoothly. The only irritating aspect was the download of over a dozen English language packs! Readers will remember that I was irritated by this in my old Debian (virtual) installation too.

Upon re-booting, I found – as expected – a few hundreds of updates. Accordingly, I allowed the system to update. Next, from the Kubuntu PPA, I upgraded KDE to 4.10.1. After verifying that the system was working properly, I restored the backup from the external USB disk. It Just Worked!

Laptop Power Management

Initially, the laptop battery powered the system for only about 2 hours. Windows (both 7 and 8) used to last between 4 and 5 hours on a single charge. I read several articles and blog posts on what all improve battery life in Linux. None of them improved the situation measurably! powertop showed a baseline discharge rate of over 23W when idling. That was disappointing!

My HP laptop has an integrated Intel 4000 HD graphics card and an nVidia GEFORCE 630M discrete graphics card. I realised that I had to try Bumblebee. Following the instructions in one of the Ubuntu fora, I installed Bumblebee. I installed the primus configuration rather than the optimus one.

Having read a few bad things about Bumblebee, I had a trepidation similar to that I had when I re-booted the frozen Windows system. Fortunately, though, Kubuntu booted normally, and to my great relief, Bumblebee worked. powertop showed a new baseline consumption rate of a little over 10W when idling! Now I get the same 4-5 hours of battery life on a single charge!

What Do I Not Like?

The power management is a little too eager. It puts the wireless interface to sleep every few minutes. For it to wake up takes several seconds upon next use. I have to keep staring at the screen until then, sometimes a little impatiently. These days, though, I use powertop to put the Broadcom 4313GN wireless card into bad state so that it is not put to sleep so aggressively.

What Do I Miss?

All of this is fine, of course, but do I miss anything? What I miss most is Google Drive native application. I usually do not sign on-line petitions, but I did sign the one at Drive4Linux. I was disappointed to find only about 1,500 signatures. Nevertheless, I signed it, and requested my G+ contacts to follow suit if they use Linux and Google Drive.

Other than the above, I have not felt any notable inconvenience or loss of functionality, so far! Thus, after a break of about five years, I have returned to running Linux natively in my primary computer!!

Return to blogging!

2012-12-27T12:17:00.001+05:30

After several months of silence, here I return to blogging! A few quick updates are in order, in no particular order.

Trivia

When the battery of my MacBook Pro began failing in May, I purchased a relatively low-end HP Pavilion dv6 7040TX pre-installed with Windows 7. I mostly like it. It generates very little heat. By contrast, the MacBook Pro is a mini heater for the winter. Another noticeable feature is battery life: I am getting about five hours of development time per charge. The only downside is the low screen resolution, which is 1366x768. In practice, though, it has proved to be adequate for my development needs.
It was a rare occasion when I surprised myself by impulsively upgrading the HP computer to Windows 8! My unfamiliarity with Windows was amply proved by the numerous devices and driver difficulties I encountered upon upgrading. Reading a related Microsoft Knowledge Base article revealed that there was an important step that I missed. [For the curious, we are supposed to uninstall and re-install the devices when upgrading in-place.]
VMware Player now offers OpenGL-based 3D support for Linux guests. Upon upgrading to the new version of Player, I realised promptly that Debian Wheezy had a problem that prevented it from recognising and utilising 3D devices. It appears as if Sid has this problem as well, since my experimental Aptosid image failed to turn on desktop effects.
Thus, I now run Linux Mint 14 KDE. [Of course, it is KDE!] It has been quite stable for my daily development needs (several Emacs windows, Eclipse 3.7 and several Konsole windows and tabs). This is in stark contrast to the frustrating experience with the Cinnamon version, which I downloaded first, mistaking it to be the KDE version. This demonstrates — yet again — why choice is so important, and why it underlies the philosophy of free and open source software!

Largely distracted months

I went through several months of non-work distractions. I am glad that those are nearing their respective conclusions. Not being able to concentrate on work can be really frustrating. More so if one's To Do list is long.

Experiments with languages

During these largely unproductive months, I studied a few languages, peripherally. Here is a summary.

Haskell

I had briefly looked at Haskell, in 2000. It looked so different, I promptly left it. Having gained a little more of functional thinking in the meantime, I decided to take another look at it. A good motivation was Hughes' influential paper "Why Functional Programming Matters". Some Haskell features are straight-forward: type classes, pattern matching, guards in functions, list comprehensions, etc. Some others are deep: higher-order functions, currying, lazy evaluation, etc. A few others go even deeper: functors, applicatives, monads, etc. Haskell demands considerable time and effort — not only the language proper, but the tool chain too. The syntax is layout-sensitive, and I could not even find a decent major mode for Emacs. The package tool called cabal repeatedly left my Haskell environment in an unstable state. Tooling is certainly a problem for the beginner, but otherwise, Haskell is a profound language that makes your thinking better!

Dart

Dart is a curious mix of the semantics of Smalltalk, the syntax of Java and JavaScript, and memory-isolated threads that communicate by sending messages. Added into this curious mix is a compile-time type system that does not affect the run-time object types! Mind you, Dart is strongly-typed. Even though there is a compile-time type system, it is optional and is primarily intended for better tooling, and the language itself is dynamically-typed. The types are carried by the objects, but the variables themselves are untyped. Dart's biggest promise is the ability to write scalable Web applications using a single language on both the server side and the client. The server side seems to present no problems, but the Web programming community is divided in its opinion on Dart's client side promise. The contention arises because Dart has its own virtual machine. Using the VM requires the user to install it as a plug-in in her browser. For those who do not want to use the VM, Dart comes with a cross-compiler that outputs equivalent JavaScript code.

D

I had known of the existence of D for several years, even through I never looked at it in detail. Reading a little about the history of D, I realised that it underwent a rather tumultuous adolescence. With D2, it appears to have entered adulthood. The motivation to look at D was an MSDN Channel 9 video of a presentation by Andre Alexandrescu. D was designed to be a better C++. Several of the design decisions behind it can be better appreciated if we hold that premise in mind. It has a simplified, more uniform syntax, garbage collection, a decent standard library and easier generics. It maintains the ability to directly call a C function in an external library, immediately making a vast set of C libraries accessible. Scoped errors and compile-time function evaluation are examples of D's interesting features. Another notable feature is the partitioning of the language features into safe, trusted and unsafe subsets, with the ability to verify that a module is completely safe, etc. D has good performance that is reasonable compared to that of C++.

Others

I also looked briefly at Erlang and Clojure. However, I did not spend enough time on them to be able to form an opinion.

The changing face of urban Hyderabad

2012-04-23T05:44:00.000+05:30

A few days ago, my family went shopping to the Ameerpet area of Hyderabad. We shopped for about an hour-and-a-half; the time was 17:00. My wife wanted to have some coffee (so did I, in fact). We could not find a place that served coffee in the immediate vicinity. We walked in the general direction of a few restaurants. Thus began an amazing hour of discovery!

We went into the first restaurant that we came across. We seated ourselves at the first table available. Presently, a waiter turned up. ``Two strong, hot coffees," I said. ``No, sir," he replied promptly, ``we don't serve coffee." I was surprised. We picked up the bags, and walked on.

At the next restaurant, we were cautious. We did not go as far as seating ourselves; rather, we waited for a waiter to approach us. ``Do you serve coffee?" I enquire. We get the same reply, ``No." I was more surprised. We walked on.

The third restaurant was a familiar one. It has been around for over twenty five years. The last I had visited it, it used to serve coffee, tea and snacks. However, that was several years ago. My four-year-old son complained of hunger by this time. He wanted a pesarattu (a special Telugu dish that is a kind of thin-and-large pancake). I felt that there was a high probability that this restaurant would serve both pesarattu and coffee. So, we climbed up a floor to the restaurant. ``No, sir. We used to serve South Indian food until about six months ago. We no longer do. Now, we serve Mughalai, Tandoori and Chinese!" I was mildly astonished. My wife and I sighed simultaneously, and we walked on.

My son was very disappointed. As we walked, he was eagerly watching for another restaurant. This time, we had to walk quite some distance before we came across another. Its look made it clear that it was a very non-vegetarian-oriented restaurant. We did not bother to walk in. My wife and I had a quick consultation, and decided to turn around, pass the shopping area, and try in the other direction.

My son's disappointment grew with each passing twenty five metres, or so. He started getting petulant. We negotiated the distance back to the shopping area with some difficulty, coaxing my son along the way. As we walked past that, we soon realised that there were no restaurants within sight! By this time, we had spent close to an hour covering a total of a little over a kilometre, without finding a place that served South Indian snacks and coffee! We resigned, got into the car, and drove back home.

The episode left me wondering, however, about the dramatic transformation that Hyderabad has undergone in the last couple of decades. It is very difficult these days to find decent (or even semi-decent) restaurants that serve Telugu vegetarian food. I have noticed the same trend in Bengaluru too, particularly for supper. A large number of restaurants have colluded to systematically eliminate South Indian menus. A key reason is that Mughalai, Tandoori, Chinese, etc. food is much more expensive. The restaurants earn significantly more per table-hour when they serve them. The constant in-flow of North Indians into Hyderabad has only made it easier for the restaurants to switch over.

Another dimension that has seeped in over the years is that of western fast food (pizzas, burgers, etc.). In the name of maintaining international quality at an international price, the western chains charge ridiculously high prices (by Indian standards) for such fast food. We have to remember, however, that economic liberalisation has placed sudden money and means in the hands of an entire new crop of employees and entrepreneurs (and their pizzas-and-potato-chips brats). India has, consequently, been witnessing rapid changes in urban social patterns. The new-found affluence has resulted in a large number of families dining out several times a week. And, in the name of novelty, a vast majority of them patronise the more expensive varieties. The smaller restaurants, obviously, do not wish to let the opportunity slip by. We see, thus, a steady decline in the number of restaurants serving native food.

Craving for the new often dislodges the old! In this instance, Telugu (South Indian, in general) food and beverages are the casualty!

Graphs and molecules - 2

2012-04-07T20:46:00.000+05:30

Note: This post utilises MathJax to display mathematical notation. It may take a second or two to load, interpret and render; please wait!

If you have not read the previous post in this series, please read it first here.

Ordering

The notion of ordering is very intuitive in the context of natural numbers. Indeed, when we learn natural numbers, their representation \(\bbox[1pt]{\{1, 2, 3, 4, \ldots\}}{}\) itself imprints an ordering relationship in our minds. Soon enough, we learn to assign a sense of relative magnitude to those numbers: 4 is larger than 2, etc. This concept extends naturally to negative numbers and rational numbers too.

A little rigour

Suppose that we represent the ordering relationship between two elements of a set using the symbol \(\le\). Then, we can define the properties that a set \(S\) should satisfy for it to be ordered.

Reflexivity: \(a \le a\forall a \in S\)
Antisymmetry: if \(a \le b\) and \(b \le a\), then \(a = b\ \forall a, b \in S\)
Transitivity: if \(a \le b\) and \(b \le c\), then \(a \le c\ \forall a, b, c \in S\)

We can readily see that integers and rational numbers satisfy the above properties. Accordingly, we say that integers and rational numbers are ordered, if we assign the meaning smaller than or equal to to the ordering relationship \(\le\).

In fact, we can see that integers and rational numbers also satisfy an additional property.

Totality: \(a \le b\) or \(b \le a\ \forall a, b \in S\)

A distinction

Totality is a stricter requirement than the preceding three. It mandates that an ordering relationship exist between any and every pair of elements of the set. While reflexivity is easy enough to comprehend, the next two specify the conditions that must hold if the elements concerned do obey an ordering relationship.

It is easy to think of sets that satisfy the former three properties, but without satisfying the last. As an example, let us consider the set \(X = \{1, 2, 3\}\). Now, let us construct a set of some of its subsets \(S = \{\{2\}, \{1, 2\}, \{2, 3\}, \{1, 2, 3\}\}\). Let us define the ordering relationship \(\le\) to mean subset of represented by \(\subseteq\). Exercise: verify that the first three properties hold in \(S\).

We see that \(\{1, 2\}\) and \(\{2, 3\}\) are elements of \(S\), but neither is a subset of the other.

Therefore, mathematicians distinguish sets satisfying only the first three from those satisfying all the four. The former are said to have partial ordering, and they are sometimes called posets or partially-ordered sets. The latter are said to have total ordering.

More ordering

Now, let us expand the discussion to include irrational numbers. Do our definitions apply? There is an immediate difficulty: irrational numbers have non-terminating decimal parts! How do we compare two such numbers? How should we define the ordering relationship? The integral part is trivial; it is the decimal part that presents the difficulty.

Sequence comparison

In order to be able to deal with irrational numbers, we have to introduce an additional notion — sequences. A sequence is a set (finite or infinite) where the relative positions of the elements matter. Another distinction is that elements can repeat, occurring at multiple places. The number of elements in a sequence, if it is finite, is called its length. Thus, sequences can be used to represent the decimal parts of irrational numbers.

Let \(X = \{x_1, x_2, x_3, \ldots\}\) and \(Y = \{y_1, y_2, y_3, \ldots\}\) be two sequences. We can define an ordering relationship between sequences as follows. We say \(X \le Y\) if one of the following holds.

\(X\) is finite with a length \(n\), and \(x_i = y_i\ \forall i \le n\) and \(y_{n+1}\) exists.
\(X\) and \(Y\) are infinite, and \(\exists\ n\) such that \(x_i = y_i\ \forall i \le n\), and \(x_{n+1} \le y_{n+1}\).

Armed with the above definition, we can readily see that we can compare two irrational numbers — in fact, any two sequences. Exercise: verify this claim by comparing two irrational numbers and two sequences of non-numerical elements!

Bottom and top elements

If a set \(S\) of sequences has an element \(b\) such that \(b \le s\ \forall s \in S\), the element \(b\) is called the bottom element of the set. The element t that we get if we replace \(\le\) with \(\ge\) is called the top element of the set. The bottom element and the top are unique in a given set.

In our first example above, \(\{2\}\) is the bottom element of the set, while \(\{1, 2, 3\}\) is the top. However, it is important to understand that bottom and top elements may not exist in a given set of sequences. Exercise: think of one such set.

Minimal and maximal elements

When a set does not have a bottom element, it is yet possible for it to have minimal elements. For an element \(m\) to be a minimal element of the set \(S\), \(s \le m \implies s = m\) should hold. If we replace \(\le\) with \(\ge\), we get maximal elements.

Minimal and maximal elements are difficult to establish (and, sometimes, even understand) in the context of infinite sets or complex ordering relationships. The same applies to bottom and top elements, too.

Conclusion

You may have begun wondering if the title of this post was set by mistake. On the contrary, these concepts are very important to understand before we tackle canonical representation of molecules, ring systems in molecules, etc., which we shall encounter in future posts.

Linux distribution chosen!

2012-03-20T08:26:00.000+05:30

I had promised to post an update towards the end of January. I did not. However, even the casual reader may have noticed my recent posts related to Debian Wheezy. Those must have served as hints. So, Debian Wheezy it is! I have finally settled on Debian Wheezy with KDE.

Perhaps apt's mechanisms suit my thinking. yum is very powerful, yet the rpm family could not win me over. In fact, at one point, I went close to going back to Arch Linux. However, it is often too cutting-edge — even for a development system.

At the same time, GUI played a non-trivial role in my decision. It also explains why Ubuntu – with its Unity desktop – did not survive in my computer. I felt GNOME to be too restrictive. Some people think that KDE has too many knobs and switches; that it is daunting. Again, perhaps its mechanisms suit my thinking.

With the decision made, I have removed the ISO files of well over a dozen distributions and the VMware images of about half-a-dozen. Peace!

Debian Wheezy : updating Java alternatives

2012-03-17T06:23:00.000+05:30

Debian Wheezy (or even Sid) defaults to Java 6. Originally, my computer had openjdk-6-jdk. I wanted to utilise the newer features in Java 7 such as higher performance and lower memory footprint, and try the Fork-Join framework. Accordingly, I installed openjdk-7-jdk. It updated the Debian alternatives for Java to point to the newer version. So far, so good!

Dependencies can upset the apple cart

Then, I installed Eclipse using apt-get. The version of Eclipse installed is 3.7.1, which is fine. However, it pulls in Java 6 as a dependency. I somehow did not pay attention to that. As the installation completed, I noticed several messages informing me that the alternatives for Java were being reset to Java 6. I bit my lip hard! I think that apt-get should explicitly warn the user if an installation downgrades a package, or more, due to dependencies.

Simple remedy

Fortunately, a simple remedy is possible. But before we begin, we should check the priorities with which both versions are installed. To check the same, issue the following command in a terminal.

> update-alternatives --display javac

# You should see something like the following.
javac - auto mode
  link currently points to /usr/lib/jvm/java-6-openjdk-amd64/bin/javac
/usr/lib/jvm/java-6-openjdk-amd64/bin/javac - priority 1061
  slave javac.1.gz: /usr/lib/jvm/java-6-openjdk-amd64/man/man1/javac.1.gz
/usr/lib/jvm/java-7-openjdk-amd64/bin/javac - priority 100
Current 'best' version is '/usr/lib/jvm/java-6-openjdk-amd64/bin/javac'.

Please note the numbers at the end of the full paths of javac. So, both Java 6 and Java 7 are installed, but Java 6 has a higher priority — 1061 to 100. It is, therefore, considered the best version. We can check where /etc/alternatives/javac points, too, for confirmation.

The remedy to apply itself utilises update-alternatives. In order to take care of all the important JDK components in one shot, I collected the commands into a shell script.

> cat up-java-alt.sh

#!/usr/bin/env sh
#
# Update Debian alternatives for Java.

update-alternatives --install \
        /usr/bin/java java \
    /usr/lib/jvm/java-7-openjdk-amd64/bin/java 1100

update-alternatives --install \
        /usr/bin/appletviewer appletviewer \
    /usr/lib/jvm/java-7-openjdk-amd64/bin/appletviewer 1100

update-alternatives --install \
        /usr/bin/apt apt \
    /usr/lib/jvm/java-7-openjdk-amd64/bin/apt 1100

update-alternatives --install \
        /usr/bin/extcheck extcheck \
    /usr/lib/jvm/java-7-openjdk-amd64/bin/extcheck 1100

update-alternatives --install \
        /usr/bin/idlj idlj \
    /usr/lib/jvm/java-7-openjdk-amd64/bin/idlj 1100

update-alternatives --install \
        /usr/bin/jar jar \
    /usr/lib/jvm/java-7-openjdk-amd64/bin/jar 1100

update-alternatives --install \
        /usr/bin/jarsigner jarsigner \
    /usr/lib/jvm/java-7-openjdk-amd64/bin/jarsigner 1100

update-alternatives --install \
        /usr/bin/javac javac \
    /usr/lib/jvm/java-7-openjdk-amd64/bin/javac 1100

update-alternatives --install \
        /usr/bin/javadoc javadoc \
    /usr/lib/jvm/java-7-openjdk-amd64/bin/javadoc 1100

update-alternatives --install \
        /usr/bin/javah javah \
    /usr/lib/jvm/java-7-openjdk-amd64/bin/javah 1100

update-alternatives --install \
        /usr/bin/javap javap \
    /usr/lib/jvm/java-7-openjdk-amd64/bin/javap 1100

update-alternatives --install \
        /usr/bin/jconsole jconsole \
    /usr/lib/jvm/java-7-openjdk-amd64/bin/jconsole 1100

update-alternatives --install \
        /usr/bin/jdb jdb \
    /usr/lib/jvm/java-7-openjdk-amd64/bin/jdb 1100

update-alternatives --install \
        /usr/bin/jhat jhat \
    /usr/lib/jvm/java-7-openjdk-amd64/bin/jhat 1100

update-alternatives --install \
        /usr/bin/jinfo jinfo \
    /usr/lib/jvm/java-7-openjdk-amd64/bin/jinfo 1100

update-alternatives --install \
        /usr/bin/jmap jmap \
    /usr/lib/jvm/java-7-openjdk-amd64/bin/jmap 1100

update-alternatives --install \
        /usr/bin/jps jps \
    /usr/lib/jvm/java-7-openjdk-amd64/bin/jps 1100

update-alternatives --install \
        /usr/bin/jrunscript jrunscript \
    /usr/lib/jvm/java-7-openjdk-amd64/bin/jrunscript 1100

update-alternatives --install \
        /usr/bin/jsadebugd jsadebugd \
    /usr/lib/jvm/java-7-openjdk-amd64/bin/jsadebugd 1100

update-alternatives --install \
        /usr/bin/jstack jstack \
    /usr/lib/jvm/java-7-openjdk-amd64/bin/jstack 1100

update-alternatives --install \
        /usr/bin/jstat jstat \
    /usr/lib/jvm/java-7-openjdk-amd64/bin/jstat 1100

update-alternatives --install \
        /usr/bin/jstatd jstatd \
    /usr/lib/jvm/java-7-openjdk-amd64/bin/jstatd 1100

update-alternatives --install \
        /usr/bin/native2ascii native2ascii \
    /usr/lib/jvm/java-7-openjdk-amd64/bin/native2ascii 1100

update-alternatives --install \
        /usr/bin/rmic rmic \
    /usr/lib/jvm/java-7-openjdk-amd64/bin/rmic 1100

update-alternatives --install \
        /usr/bin/schemagen schemagen \
    /usr/lib/jvm/java-7-openjdk-amd64/bin/schemagen 1100

update-alternatives --install \
        /usr/bin/serialver serialver \
    /usr/lib/jvm/java-7-openjdk-amd64/bin/serialver 1100

update-alternatives --install \
        /usr/bin/wsgen wsgen \
    /usr/lib/jvm/java-7-openjdk-amd64/bin/wsgen 1100

update-alternatives --install \
        /usr/bin/wsimport wsimport \
    /usr/lib/jvm/java-7-openjdk-amd64/bin/wsimport 1100

update-alternatives --install \
        /usr/bin/xjc xjc \
    /usr/lib/jvm/java-7-openjdk-amd64/bin/xjc 1100

Please note that we used a priority value of 1100, so that we can assign Java 7 a higher priority than that of Java 6. Now, we run the above script, and check again the alternatives status, and where /etc/alternatives/javac points.

> update-alternatives --display javac

# You should see something like the following.
javac - auto mode
  link currently points to /usr/lib/jvm/java-7-openjdk-amd64/bin/javac
/usr/lib/jvm/java-6-openjdk-amd64/bin/javac - priority 1061
  slave javac.1.gz: /usr/lib/jvm/java-6-openjdk-amd64/man/man1/javac.1.gz
/usr/lib/jvm/java-7-openjdk-amd64/bin/javac - priority 1100
Current 'best' version is '/usr/lib/jvm/java-7-openjdk-amd64/bin/javac'.

Enjoy Java 7 again! Don't forget to change the default JVM path in Eclipse, though.

What about the man pages installed as slaves? That part is left as an exercise :-)

Convergent synthesis, finally!

2012-03-08T15:11:00.002+05:30

The legacy C++ version of my chemistry product has finally gained the capability to do convergent synthesis — in a primitive form for now.

What is convergent synthesis?

Consider a complex molecule that we wish to synthesise. In the usual method, the synthesis steps proceed linearly. Suppose that the following is the sequence of synthesis steps, where Goal designates the molecule to synthesise.

A → B → C → D → E → F → G → H → I → J → Goal

The above sequence has 10 steps in the process. We start with a simple, available molecule A. Presumably, a functional group is either added, replaced or deleted at each step. [Reality includes several more dimensions, but let us keep the discussion simple.] While easy to comprehend, the biggest problem with this is the effective yield of the route. For the purposes of discussion, let us assume an average yield of 85% per step. With 10 steps, the effective yield is less than 20%!

Contrast this with the following scheme.

P → Q → R ⌉
          | → Goal
S → T → U ⌋

In this case, we have two independent paths leading to moderately complex molecules R and U. Then, these two paths converge to give rise to a more complex molecule, which in this case is our goal molecule. The expectation is that since R and U are only moderately complex, they can be independently synthesised in a couple of steps each. The effective yield for the convergent route, then, is about 44%! This is, obviously, much more attractive.

How does it work in retro-synthesis?

My product actually does retro-synthesis, i.e., it starts with the goal molecule, and constructs the steps in reverse order. At each step, the product molecule is broken into reactants. In most scenarios, the coreactant is a trivial molecule; or, it is directly available for purchase from a company like Sigma Aldrich.

If we wish to take advantage of convergent synthesis, on the other hand, how we break a product molecule into possible sets of reactants becomes a matter of extreme significance. In the step R + U → Goal, effective convergence is possible if and only if R and U have about the same degree of complexity.

What next?

The immediate challenge is to locate such reactions, and build a repertoire of them. Another, of course, is to be able to resolve a product molecule to utilise one such! Our chances depend on being able to identify a reasonably central atom that is suitable for initiating the cleavage of the product molecule. The algorithms have to be refined to exploit this new capability, as well!

Difficult decision - 2

2012-03-07T14:05:00.000+05:30

Firstly, I am surprised at the amount of traffic my previous post has generated. I had over a thousand visitors within a few hours: redirected from Google, Reddit, Hacker News, etc.! I wonder how may of them would have read the post had it been the opposite way, i.e., a declaration that I was adopting Go for the next version of my chemistry product? It appears to demonstrate an intriguing aspect of human nature!

Plug-in

Several people have suggested various ways in which out-of-process plug-ins are better. Some of these suggestions arose probably because of me not describing a plug-in adequately. I tried to remedy the situation in individual responses. I am collecting some of those points hereunder.

I am looking at plug-ins for at least the following benefits. In all cases, the said plug-in could potentially be supplied by me, a third-party developer or the user herself.

Substitute an algorithm for another.
Substitute an algorithm implementation for another.
Add a new algorithm not originally shipped with the product.
Add a calculator or a transformer as a hook in a particular processing step.

Performance

My application has to process millions (sometimes tens of millions) of molecules per run. A large number of them are processed by the proposed plug-ins. The number reduces with each advancing stage of processing, owing to elimination in each stage. The load is, hence, lower towards the tail of the work flow. But, upstream plug-ins are invoked for most molecules.

An out-of-process plug-in will require the following steps for communication:

serialisation of input in the main program,
deserialisation of input in the plug-in,
serialisation of output in the plug-in, and
deserialisation of output in the main program.

The above steps are in addition to the unavoidable protocol handshake for each request. Evidently, the larger the amount of data that needs to be exchanged between the main program and the plug-in, the slower the above process will be. Let us look at the data structure that will get exchanged the highest in my application, viz., Molecule. It has the following information, at a minimum:

a unique identifier,
a list of atoms, where each atom has:
- a unique identifier,
- element type,
- spatial coordinates,
- net charge,
- stereo configuration,
- number of implicit H atoms attached to it,
- aromaticity,
- list of rings it is a part of,
- whether it is a bridgehead,
- …
a list of bonds, where each bond has:
- a unique identifier,
- the atoms it joins,
- order: single, double, triple, aromatic, …,
- stereo configuration,
- aromaticity,
- list of rings it is a part of,
- …
a list of rings with their own properties,
a list of components,
a list of functional groups,
… .

It may be possible to have a protocol to allow a plug-in to declare the subset of the above data that it actually needs. However, checking the protocol and selectively serialising the data has a cost itself.

On the contrary, an in-memory plug-in accesses the object using a pointer, with just a transfer of ownership but no transfer of data.

Manageability

External plug-ins also raise the subject of their life cycle management and resolution. The questions that need to be addressed include the following.

When is a plug-in activated? Together with the main program? On demand?
Should plug-ins die with the main program? How do we handle the main program terminating abnormally?
How long should a plug-in continue to run, if idle?
How should zombie plug-ins be handled?
If socket-based, how should port numbers be managed?
How should multiple plug-ins providing the same capability be resolved? How about versions?

While in-memory plug-ins do not automatically solve all the above, they do eliminate some of them easily.

Interesting Options Suggested

An interesting option that was suggested was to package the Go binary distribution as part of my product. Then, when a plug-in is downloaded, the main program itself could be re-compiled and re-linked to include the plug-in. This is a possibility. Some infrastructure code has to be written, though.

Another family relates to embedding a scripting language. This is another possibility. However, it is neither fair nor acceptable to force third-party plug-ins to have to always suffer (the relatively) lower performance because of the scripting language itself. This may become very important if the plug-in is intended for use in the initial stages of the work flow.

The Need for a Java API

Independent of the above, there is no straight-forward way to provide a Java API on the top of an application written in Go. This issue remains unaddressed.

Difficult decision

2012-03-01T22:24:00.000+05:30

Considering the number of hits this post continues to have on a weekly basis, I should point out that a follow-up post can be found at A Difficult Decision - 2.

I like Go as a language. It is a rare, good balance between simplicity, power and expressiveness. I was so impressed with it as to begin the development of the next version of my chemistry product in it. There were minor hiccups, but I made progress, and wrote as many as about 12,000 lines of code (includes comments).

Then, I paused.

Now, about two months later, I have decided to shift the development of the new version of my product to … Java. If that sounds anti-climactic, so it is!

There are two significant reasons for this decision.

Plug-ins

The intended users of my product (and its API) are organic chemists and cheminformatics programmers. In practice, several organic chemists neither program themselves nor have programming assistants.

My product has several basic capabilities built into it. These are exposed as callable or configurable algorithms. What I also want to do is enable the user to extend the capabilities of the product through plug-ins.

In the Java land, this is well-studied problem, with more than one established way of solving. In the Go scenario, on the other hand, this is not straight-forward. Go programs are linked statically. There are requisites to enable plug-ins.

What would I need to do?

Supply the statically-linked full product.
Supply *.a files of my product's packages, and *.o files of the driver part.

What would my customer need to do?

Install gcc and family and their dependencies. [Once; update: not needed after Go1. Thanks Andrew Gerrand!]
Install (and maintain) the Go development environment. [Once, at least after Go1.]
Download the plug-in package file .a, and place it in a designated directory.
Build and install the application.[1]

Why will this not work?

It could, with programmers. It, unfortunately, will not, with chemists. For many chemists, the exertion of supplying input to a computer program and reading its output using a suitable viewer, is a considerable condescension. Should I ask them to compile code, and that too for each plug-in, typical chemists will laugh me out of court!

The API Consumption Issue

Most foot soldier programmers at pharmaceutical companies, biotechnology companies and informatics services providers have graduated in the last ten years, or so. Most of them learned Java as their first language, and have used it the most. Naturally, they tend to gravitate towards solutions with a published Java API.

Chemistry product companies acknowledge this. Almost all of them provide a Java API for their products and libraries.[2] Those that have remained with a non-Java API are certainly not the leaders today!

Therefore, for me to appeal to those programmers, I must provide a fully capable Java API. My prospects of commercial success depend on that factor in no minor way.

Conclusion

Accordingly, I have shifted the product to Java. My team and I have begun re-designing and re-implementing the legacy product — this time to suit a more Java-like style; to appeal to Java audience!

[1] goinstall does not help, since it needs source code to be available. Plug-in authors may not be willing to publish source code.↩

[2] If they cannot provide a proper Java API, they at least provide a JNI wrapper.↩

Debian Wheezy : Removing unused locales and translations

2012-02-27T19:29:00.000+05:30

Removing Locales

Debian includes several locales and translations in its default installation. Typically, we need very few of them, mostly just one! Here is how I removed unused locales and translations from my Debian Wheezy installation. This should apply to Sid too, but I have not verified it. The following steps should be executed as root.

Issue the command locale -a -v to see a full list of the locales currently installed. Should this list already match your requirement, you can skip the remainder of this section.
Edit (or create) the file /etc/default/locale. Include the entries for your desired locale. My file looks as follows.

#  File generated by update-locale
LANG=en_IN
LANGUAGE="en_IN:en"

Remove all the files in the directory /usr/lib/locale.
Now, regenerate the locales based on your default configuration.

Here is the sequence of steps.

> locale -a -v
# Review the current locales.

> vi /etc/default/locale
# Edit the contents to suit your requirement, and save the file.

> cd /usr/lib/locale
> rm -fr *

> locale-gen

Removing Translations

Unneeded translations consume disk space, network bandwidth (when updating or upgrading), and can potentially make glibc larger. To remove unused translations, execute the following steps as root.

> cd /etc/apt/apt.conf.d
> touch 99translations

# Edit the file to match the following content.
> cat 99translations
Acquire::Languages "none";

# We have to remove already indexed translations.
> cd /var/lib/apt/lists
> rm -f *Translation*

If you want to be sure, you can reboot the system. Now, when you apt-get update or apt-get upgrade, you should no longer have unused translations checked for, updated or downloaded.

Debian Wheezy : Running in VMware Fusion 4.1

2012-01-09T14:08:00.001+05:30

Note: The following applies to Debian Sid as well.

Here is some information on how I could get Debian Wheezy to run successfully in VMware Fusion 4.1.

The installation itself was uneventful. Once Wheezy installed, I proceeded to unpack VMware Tools, and run the installation script. However, the version of VMware Tools that comes with VMware Fusion 4.1 does not work properly with kernel 3.2.x.

After a few failed attempts at patching the code and trying again, I uninstalled VMware Tools. I decided to try open-vm-tools instead.

Here is the sequence of steps that I executed (as root, of course).

apt-get update
apt-get install build-essential
apt-get install open-vm-tools

You will see some errors about modules not being available. It is normal. Please ignore them, and proceed.

apt-get install open-vm-dkms

The above should install dkms itself as a dependency. By the way, dkms is a system that can install dynamically loadable kernel modules. The above package contains the source code of the Open VM Tools kernel modules.

cd /usr/src
ls

You should see a directory for open-vm-tools. Please note the version number, which is the part after open-vm-tools-. Now, issue the command

dkms install open-vm-tools/<version>

where you should substitute the version number that you found out above. That installs the necessary kernel modules. It is necessary to create an appropriate entry in /etc/fstab for your host Mac's share. An example entry is shown here.

.host:/mac    /mnt/hgfs/mac    vmhgfs    defaults    0  1

For X automatic re-sizing and copy-and-paste between OS X and Wheezy, you have to install one last package.

apt-get install open-vm-toolbox

That is it. Now, reboot the virtual image, and enjoy your Debian Wheezy with better host integration!

Braces, scopes and compilers

2012-01-07T10:42:00.000+05:30

I ran into an interesting problem with my legacy chemistry application (not the new one). We have over 5,700 tests for the application. Each test includes a molecule that characterises a family of products and another of reactants. Outside those tests, we use several standard – as in drugs – molecules to test the capabilities of the program.

For some time now, due an error in the test scripts driver, the tests have not been actually getting run. The driver has falsely been reporting success for each test. Two days ago, I noticed that, and corrected the driver script. I then ran the corrected script. 5,743 of the 5,744 tests passed.

Test m5212, the failing molecule, was an interesting case. After some debugging, I localised the error to a file SRC/map.cpp, function initialmaps(). Here is the code segment in question.

4533:  if (ignoretop)
4534:      for (j = 1; j <= pmol.numat; ++j)
4535:        if ((uniqueinmol(pmol,j) || (pmol.at[j].top == j))
4536:          &&
4537:          (1 != pmol.at[j].unsat))
4538:              for (g = 1; g <= mol1.numat; ++g)
4539:                  if ((uniqueinmol(mol1,g) || mol1.at[g].top)
4540:                  &&
4541:                  ((pmol.at[j].attribute == mol1.at[g].attribute) ||
4542:                  ((pmol.at[j].atnum == mol1.at[g].atnum) &&
4543:                  pmol.at[j].heteroaromatic &&
4544:                  mol1.at[g].heteroaromatic) ||
4545:                  tautomeric(pmol,mol1,j,g)))
4546:                      if (fit(pmol,tempmap,j,g))
4547:                          {
4548:                          tempmap.addapair(pmol,j,g);
4549:                          nmap.AddToTail(tempmap);
4550:                          check_nummap_limit();
4551:                          tempmap.clear();
4552:                          }
4553:  else
4554:      for (j = 1; j <= pmol.numat; ++j)
4555:          if ((uniqueinmol(pmol,j) ||
4556:          (pmol.at[j].top == j) ||
4557:          ((6 != pmol.at[j].atnum) && !pmol.at[j].top) ||
4558:          ((gring = atring.ringcontaining(pmol,j)) &&
4559:          pmol.ringlist[gring].present(pmol.at[j].top)))
4560:          &&
4561:          (1 != pmol.at[j].unsat))
4562:              for (g = 1; g <= mol1.numat; ++g)
4563:                  if ((uniqueinmol(mol1,g) ||
4564:                  (mol1.at[g].top == g) ||
4565:                  ((rring = atring.ringcontaining(mol1,g)) &&
4566:                  (mol: 1.ringlist[rring].present(mol1.at[g].top))))
4567:                  &&
4568:                  ((pmol.at[j].attribute == mol1.at[g].attribute) ||
4569:                  ((pmol.at[j].atnum == mol1.at[g].atnum) &&
4570:                  pmol.at[j].heteroaromatic && mol1: .at[g].heteroaromatic)
4571:                  || tautomeric(pmol,mol1,j,g))) // ex.  MOL/: m2944
4572:                      if (fit(pmol,tempmap,j,g))
4573:                          {
4574:                          tempmap.addapair(pmol,j,g);
4575:                          nmap.AddToTail(tempmap);
4576:                          check_nummap_limit();
4577:                          tempmap.clear();
4578:                          }

Experienced programmers must have already realised the issue. For the novices, here is a little explanation. From the indentation, it appears as if the programmer's intention was to have the else on line 4553 act in conjunction with the if on line 4533. Unfortunately, though, C++ does not look at the code that way. It matches the else on line 4553 with the most recent unbalanced if. And, such an if occurs on line 4546. Here is how re-indented code looks.

4533:      if (ignoretop)
4534:          for (j = 1; j <= pmol.numat; ++j)
4535:              if ((uniqueinmol(pmol,j) || (pmol.at[j].top == j))
4536:                      &&
4537:                      (1 != pmol.at[j].unsat))
4538:                  for (g = 1; g <= mol1.numat; ++g)
4539:                      if ((uniqueinmol(mol1,g) || mol1.at[g].top)
4540:                              &&
4541:                              ((pmol.at[j].attribute == mol1.at[g].attribute) ||
4542:                               ((pmol.at[j].atnum == mol1.at[g].atnum) &&
4543:                                pmol.at[j].heteroaromatic &&
4544:                                mol1.at[g].heteroaromatic) ||
4545:                               tautomeric(pmol,mol1,j,g)))
4546:                          if (fit(pmol,tempmap,j,g))
4547:                          {
4548:                              tempmap.addapair(pmol,j,g);
4549:                              nmap.AddToTail(tempmap);
4550:                              check_nummap_limit();
4551:                              tempmap.clear();
4552:                          }
4553:                          else
4554:                              for (j = 1; j <= pmol.numat; ++j)
4555:                                  if ((uniqueinmol(pmol,j) ||
4556:                                              (pmol.at[j].top == j) ||
4557:                                              ((6 != pmol.at[j].atnum) && !pmol.at[j].top) ||
4558:                                              ((gring = atring.ringcontaining(pmol,j)) &&
4559:                                               pmol.ringlist[gring].present(pmol.at[j].top)))
4560:                                          &&
4561:                                          (1 != pmol.at[j].unsat))
4562:                                      for (g = 1; g <= mol1.numat; ++g)
4563:                                          if ((uniqueinmol(mol1,g) ||
4564:                                                      (mol1.at[g].top == g) ||
4565:                                                      ((rring = atring.ringcontaining(mol1,g)) &&
4566:                                                       (mol1.ringlist[rring].present(mol1.at[g].top))))
4567:                                                  &&
4568:                                                  ((pmol.at[j].attribute == mol1.at[g].attribute) ||
4569:                                                   ((pmol.at[j].atnum == mol1.at[g].atnum) &&
4570:                                                    pmol.at[j].heteroaromatic && mol1.at[g].heteroaromatic)
4571:                                                   || tautomeric(pmol,mol1,j,g))) // ex.  MOL/m2944
4572:                                              if (fit(pmol,tempmap,j,g))
4573:                                              {
4574:                                                  tempmap.addapair(pmol,j,g);
4575:                                                  nmap.AddToTail(tempmap);
4576:                                                  check_nummap_limit();
4577:                                                  tempmap.clear();
4578:                                              }

How nice! This disconnect between the programmer's intention and C++'s understanding of the code is – evidently – caused by the absence of braces { and } in appropriate places to delimit the scopes. For that – and to aid my own comprehension – I use braces always; even if the scope has only one statement.

Now, if I insert a { at the end of line 4533 and a } before the else on line 4553, test m5212 passes. But, four other tests, all of which pass otherwise, fail! They are m3651, m3747, m4750 and m5211. Sigh! This needs a deeper investigation.

Linux distribution trials have not ended :-(

2011-12-25T10:26:00.000+05:30

Following up on my August post regarding Linux distribution trials, I could not – in fact – continue with CentOS. There are two important reasons.

yum is a really fragile tool for package management. Not only could I not understand what wrong I had done to break it, following its instructions on how a sane package database state could be recovered just never worked! And, then, it kept complaining endlessly about Update Manager keeping the database locked, while Update Manager kept complaining endlessly that a different package management tool was keeping the database locked. I followed resolution suggestions from about a dozen Web sites, with no success whatsoever. Even deleting the entire package database did not help! This paints such a sorry picture of Linux!
CentOS makes it really difficult to install a broad set of essential 32-bit libraries. Firstly, they are not available by default in the 64-bit repositories under a single compat-kind of title. At least, I could not find one easily. Hence, I had to let a 32-bit software such as Skype or TeamViewer fail once for each dependency, then figure out which package provided that library, and then install its .i386/.i686 version. This is tedious at best.

Consequently, I moved on. My current experiment is with openSUSE 12.1 KDE spin. I have returned to openSUSE after almost five years. I do not remember if zypper was the default package management tool back then. After going through the man pages, and a few trials, I got used to it. I tried to break it in one or two small ways, but it has withstood so far!

The OS itself has been quite stable in these several weeks of use. I disabled nepomuk and strigi, and left KDE notifications on. The CPU usage when idle is good (<= 1.5%), KDE has been very responsive and stable, and to top it all, openSUSE 12.1 came with Go r60.3 in the repositories!

I have moved one of my developers to it. In its default configuration, openSUSE does not allow non-root users to connect to a wireless network. That is the only complaint that I have received yet. I resolved it by making my trusted wireless network a system connection.

Thus, life looks good so far! I shall update the status some time towards the end of January 2012 again.

No more Apple products!

2011-12-24T11:19:00.000+05:30

No more Apple products! I have made that decision after a careful examination of several incidents.

For the record, I have a MacBook Pro 15" and an iPod Touch (second generation) for my wife, and a MacBook Pro 17" for myself. Now, I think that I should not have purchased them in the first place.

Why did I buy them, then?

I bought the iPod Touch because I thought that it would be useful for quick e-mail checks and responses. That it could carry my music was secondary; more like tertiary, since I browsed the Internet using it much more than I listened to the music in it.

When I first bought the 15" MacBook Pro, my primary operating system for well over a decade had been Linux. I purchased a Mac Mini for cross-browser compatibility testing at work. I was bewitched by the cuteness of the user interface, even though I was at variance with some of its UX concepts. When I discovered the command line and the POSIX layer beneath the UI, though, it appeared as if my wishes were answered. So, I went ahead and ordered the MacBook Pro.

The first alarm bells did ring when I placed the order. The price was unreasonably high. Worse, the India prices were at close to a 60% premium over the corresponding US prices. But the charm worked, and I parted with the money.

When the computer did arrive, I was unhappy to see that it came with US power socket pins and no Indian pin adapter. I had to purchase an adapter myself.

Over a period, my usage of the iPod Touch decreased (more about this later), and I gave it to my wife so she could use it for listening to music.

Trouble strikes

A few months later, the battery of the MacBook Pro failed rapidly. After trying unsuccessfully to de-hysterisize it by draining it fully, I realized that it was to no avail. I went to the local Apple reseller. The genius there examined my computer, and declared that the battery was dead and needed to be replaced. And, with a rueful countenance, he further declared that it had just run out of warranty. I was perplexed. I showed the system diagnostics to him, pointing out that the battery had gone through only 310 recharge cycles. With a superior smile, he advised me that the typical life of laptop batteries was only 300 cycles! I was now disappointed. With an effort at being civil, I told him that the computer in context was my fourth or fifth, and that I had enough experience to know that laptop batteries do not die so fast. He simply ignored me, and asked me if I wanted a new battery or not. By now, I was enraged. However, I had no choice. So, I paid a ridiculously high price - the Apple premium - for a new battery.

What about the other MacBook Pro?

Technically, I did not purchase it. One of my collaborators - who lives in Toronto, Canada - did. His primary operating had been Linux, too, for an even longer period. When I assured him that he could continue to run his familiar shell and gcc in MacOS X, he took the plunge. He purchased a 17" MacBook Pro. However, starting with an unfamiliar interface, the default user interface and behavior of Finder, and inability to easily install new packages (compared to yum install package), his ride was rough.

After some research, I suggested macports to him. However, he ran into nuisances when some libraries installed through that (as dependencies) began conflicting with those installed by Xcode. After a while, he gave up, and simply shipped it to me. And, he did it without warning me!

Trouble strikes, again!

Two days before the 17" computer arrived, I had gone to the local Apple reseller, and purchased an OS upgrade for my 15" computer, from Leopard to Snow Leopard.

My gentle collaborator forgot sending the power adapter when he sent the 17" computer. So, I needed to purchase one if I wanted to use the computer. Reluctantly, I went to the Apple reseller, again. Once again did I have to pay the Apple premium, this time for the power adapter. I opened the pack, and this time, I found that it had UK power socket pins! That was ridiculous! I asked the reseller why I was given an adapter with UK pins. He shrugged, and replied that others simply purchased an additional converter. And that he could sell me one if I wished to buy it! I was getting indignant.

Nevertheless, I showed him the Snow Leopard upgrade that I had purchased two days earlier. It was new and unopened. I asked him to take it back, and give me a family pack instead, since I now had two computers to upgrade. He flatly refused. I raised the matter to the store manager, urging him to be reasonable. He seconded his genius, and said that it was against Apple policies. If I wished to upgrade the second computer, I had to purchase another single upgrade copy. I was livid!

Meanwhile, why did I stop using the iPod Touch?

The biggest reason was that I could not watch technology talks in it. Why? Because it cannot play Flash content. Another problem was that its Exchange ActiveSync did not allow me to install a self-signed certificate, barring me from accessing my corporate e-mail in it.

Conclusions

Apple products are needlessly expensive. And, on the top, there is no guarantee that its expensive hardware is as durable as much less expensive hardware from other vendors.
Apple is an arrogant company. It has no respect for its customers. It has sold equipment to me, in India, with US and UK power socket pins, forcing me to spend additional money on adapters/converters.
Apple's policies work against its customers. They did not exchange the brand new, unopened Snow Leopard upgrade pack for a family pack.

I am not going to purchase another Apple product! Not until a time all of the above changes. Today, I spend almost all of my time on the computer inside a Linux VMWare Fusion image. And, yes, I have not upgraded the MacOS.

Draft 2 of a proposal for generics in Go

2011-11-11T19:36:00.001+05:30

N.B. This is draft version 2 of a work in progress. It is intended to help me understand the subject better, elicit comments from more knowledgeable people, and elucidate the nuances of generics in Go's context. The following proposal makes use of a syntax for ease of presentation. The syntax is, however, not intended to be the primary subject, nor is it intended to represent the complexities or possible complications of the underlying mechanisms.

Generics

In Go, there are fundamental differences between built-in types and user-defined types. When we try to come up with mechanisms that help treat both of them on a completely equal footing, they seem to stretch the language along difficult dimensions. Particularly troublesome are issues such as the following.

Built-in types (like int and float64) do not have methods; they only have a pre-defined set of operators. Users cannot redefine those operators, nor can they define new operators.
Operator overloading is not allowed, preventing user-defined types from exhibiting operator compatibility with the built-in types.
Operators are not methods themselves, and hence do not allow a symmetry between the built-in operations and the user-defined.

Rather than explore possible solutions to the above, the current proposal accepts them as facts that will not change. In other words, the inherent differences between built-in types and user-defined types are preserved, and no unification is attempted. By adding one new concept (that of type classes, see hereunder) to the language, backwards-compatibility can be maintained even as generics are introduced into the language.

This proposal considers the following points as the base requirements for a working model of generics in Go.

Enabling generics for built-in types as well as user-defined types.
Avoiding boxing/unboxing and implicit casts.
Preserving Go's (package-level) separate compilation model.
Maintaining reasonably high compilation speed.
Promoting (generated) code sharing.
Allowing for reflection of generics.

Storage Classes

Types T1 and T2 belong to the same storage class if elements of T1 can either be assigned to those of T2, or converted to T2, and vice versa. Please note that only conversion is considered, not a type assertion.

Type Classes

Types T1 and T2 belong to the same type class if the following are true:

T1 and T2 belong to the same storage class and
every operation defined on elements of T1 is also defined on those of T2, and vice versa.

As per the above definition, for instance, int32 and int64 belong to the same type class, while int64 and complex64 do not. Thus, type classes play the equivalent role of interfaces for built-in types. However, type classes do not introduce new concrete types.

Predefined Type Classes

Go language defines a set of predefined type classes for numeric types, as enumerated here.

Integer This type class encompasses the following built-in types: int, int8, int16, int32, int64, uint, uint8, uint16, uint32, uint64 and byte.
Float This type class encompasses the following built-in types: float32 and float64.
Real This type class composes the type classes Integer and Float above.
Complex This type class encompasses the following built-in types: complex64 and complex128.

In addition, the set of all named interface types in a program constitutes a single type class. This class is unnamed, and is not accessible to the users; it is defined solely for the purposes of implementation.

It is not possible to define new type classes.

Inability to define new type classes may be a limitation. However, my current thinking is that the capability to define such may not be a pressing requirement. In any case, such a capability cannot be provided until the mechanisms governing the composition of type classes is unambiguously understood.

Generic Types

A type G with a type parameter T is said to be generic in T. It is denoted by G<T>. It is allowed for a generic type to be generic in more than one type. Where multiple type parameters need to be specified together, they are comma-separated. Go's generic types are reified, and are amenable to introspection at run time.

A type argument T can be any valid, assignable (within the package scope of applicability) Go type, with the exception of interface{}. The semantics of interface{} interfere with the compile-time checks that generics require. Moreover, the choice of interface{} for a type expressly avoids compile-time specification of a concrete type on the part of the user. In a sense, the compile-time mechanisms of generics constitute the conjugate space of that constituted by the run-time mechanisms of interface{}.

The semantics of generic types are dependent on the type parameter constraints (described hereunder) and actual type arguments; in particular, value types and reference types lead to different semantics.

A type parameter T can appear in the following positions (or sites):

type parameter declaration,
variable type specification,
function/method parameter type specification and
allocation.

Type Parameter Declaration

Type parameter declaration makes a new type parameter available for subsequent use within the applicable scope. Such a declaration does not introduce a new concrete type. It is illegal to declare a type parameter that is unused within that scope; similarly, it is illegal to use a type parameter that is not declared in that scope or any of its enclosing scopes. Type parameters are not shadowed. Hence, it is illegal to declare a type parameter that is already in scope, again.

A generic type G<T> with a specification for T introduces a new concrete type. This new concrete type behaves in full compliance with the rules governing normal Go types, including those for type inference, with explicit exceptions where noted.

Given a generic type G that is dependent on type parameters T₁, ..., T_m, it is possible to construct a generic type H that is the same as G with n, n<m type parameters instantiated. This results in H being generic in m-n parameters.

type HashTable<K, V> struct {
    ...
}

type HashSet<K> HashTable<K, bool>

Variable Type Specification

Using a generic type as the type of a variable requires specification of concrete types for the generic type's type parameters. The number of specified type arguments should match that in the generic type's declaration. It is illegal to leave any type parameter uninstantiated, unless such parameter is in scope from an enclosing declaration.

var h1 HashTable<string, float64>  //   Valid.
var h2 HashTable<string, T>        //   Invalid; T is not instantiated.

It is allowed to specify a type parameter as the type of a variable, as long as the type parameter has a valid declaration in the current scope.

It is illegal for a naked type parameter to itself act as a generic type. For instance, the following is illegal.

type A<T, U> struct {
    x T<U>  // Invalid.
}

Function/Method Parameter Type Specification

A function or a method can utilize type parameters available in scope. The function/method will then be generic in those type parameters utilized.

func (h *HashTable<K, V>) At(k K) V {
    ...
}

A function/method may itself be generic in one type parameter or more. This is independent of its being generic in one type parameter or more from an outer scope.

func modulus<T>(v T) float64 {
    ...
}

Allocation

In the presence of type parameters, allocation requirements may be determined at run time. The type class of the type parameter determines the allocation mechanism. Also, value types and reference types have different semantics, as described later.

type S struct {
    ...
}

var v = new(HashTable<string, S>)

An Example

The following example illustrates the above.

// T appears in the position of slice type specification.
type Vector<T> []T

// Note that T needs to be specified again in Vector<T> for it to be in scope.
// T appears in the position of method parameter type specification.
func (v *Vector<T>) Push(el T) {
    *v = append(*v, el)
}

 // U is a method-specific type parameter.
func (v *Vector<T>) Map<U>(mapper func (T) U) *Vector<U> {
    u := new(Vector<U>)  // Allocation depends on U.
    for _, el := range *v {
        u.Push(mapper(el))
    }
    return u
}

Constrained Generic Types

A type parameter can optionally be constrained to only those types that satisfy a particular interface. This enables all the methods from the interface to be utilized in the context of variables belonging to the applicable type parameter.

type Comparable<E> interface {
    Compare(E) int
}

type OrderedSet<E : Comparable<E>> struct {
    ...
}

The above declaration allows ordered sets to be constructed over any type whose elements are ordered according to the Compare() relation.

Similarly, a type parameter can optionally be constrained to only those types that belong to a particular type class.

type Vector<T : Integer> []T // A vector over all integer types.

The above declaration allows a generic vector type to be written with algorithms that provide operations for any integral type.

Constraints on type parameters can appear at all sites except allocation sites. A constraint on a type parameter need only be specified when it is introduced for the first time. In the example above, for instance, the methods of OrderedSet need not repeat the constraint on E.

It is illegal to specify more than one constraint for a given type parameter.

Implementation Notes

Caveat: I do not understand the current Go compilers. Hence, the following is only a hint. More knowledgeable people should be able to comment on the naïveness, the feasibility and the complexity of implementing such a scheme as what is suggested here. I have represented some of the logic in the form of a high-level pseudocode, in the hope of its making it easier to discuss the same.

The following implementation notes cover various aspects of compilation, linking and reflection.

Compilation of Generics

In Go, compilation of a generic type generates two outputs: (a) generic type metadata and (b) partially compiled form of the generic.

Generic Type Metadata

At least the following metadata is generated for a given generic type, during compilation. This information is packed into a structure, which we refer to as type handle, that is amenable to reflection.

unique identifier of the generic,
number of type parameters,
type handles of all type parameters,
function descriptors of all methods and
a pointer to the method table.

A similar type handle is generated for each type parameter. It contains at least the following metadata.

unique identifier of the type parameter,
a slot to hold the size of an instantiated type,
any constraint specified for it and
a pointer to its entry in the applicable function/method table.

In the case of functions and methods, a function descriptor is generated. As with a type handle, this too is amenable to reflection. It contains at least the following metadata.

unique identifier of the function/method and
type handles of all type parameters.

It should be noted that the implementation distinguishes function-specific, or method-, type parameters from those declared in any applicable enclosing generic type.

If the generic type or function is exported, the above information is exported from the package of origin, and is available to the compiler, the linker and the runtime.

Partial Compilation

Each type parameter of the generic type serves as a slot that is filled for each instance of the generic type. A pointer to the metadata structure mentioned above fills this slot.

When compiling code, the size of the involved data types needs to be known. The following mechanism is used in order to arrive at one, for each type parameter.

When a generic type is compiled, the compiler first checks to see if its type parameters are constrained. For each type parameter constrained to a type class, the size of its elements is chosen to be that of the widest type (not in the variance sense, since Go does not support classical inheritance, but in the storage sense) in that class.

For each type parameter that is either constrained to an interface or unconstrained, the size is looked up from its type handle. This size is only a placeholder; it is filled, and may differ, for each instantiation of the generic type. This is applicable to all user-defined types supplied as value type arguments to any generic type.

compileGeneric() {
    if generic is a struct or an interface {
        generate type handle metadata placeholder
    } else {  // Function or method.
        generate a function descriptor placeholder
    }

    for each type-parameter {
        if type-parameter is constrained {
            if constrained to a type-class {
                use the size of the widest type in class
                point to the predefined pseudo-method-set of this type-class
            } else {  // Constrained to an interface.
                set a mark to obtain the size from type handle
                setup, or point to, the method set of the interface
            }
        } else {
            set a mark to obtain the size from type handle
            point to a predefined empty method set
        }
    }

    type check the code as per the above type handles and method sets

    generate Go AST using the above type handles and function descriptors
    serialize the generated AST into the package binary
    set pointers for functions, methods and method tables
}

Notes

The choice of an AST representation of the generic's code is with the intention of preserving the separate compilation model of Go. It also relieves us of the need to invent a new exportable intermediate language; we just use a standard part of the language. The package go/ast may have to be enhanced to support this requirement.
Please note that the generic code is type checked only once, in its package of origin. For each instantiation, the supplied type arguments are checked only against any specified constraints. This speeds up compilation, since there is no need for repeated full type checks. Even though we are using an approach opposite to that of C++ (where type checks are completely use-site based), we are not going anywhere near H-M type system.
The above mechanism - as described - is not space-efficient for type parameters constrained to type classes. For instance, a vector of bytes will occupy more than a byte per byte. This is a trade-off between monomorphizing (code bloat) and user data space-efficiency. However, this could be tuned.
This look up of the actual size from the corresponding type handle introduces a small run time overhead. However, it greatly facilitates code sharing. It may be possible, and even easy, to avoid this overhead through the use of some clever trick. I do not know enough!
I also have not thought through any possible repercussions of differently-instantiated recursive functions and methods. For instance, foo<T1>(x T1) may have a call to foo<T2>(y), where y has type T2.

Compilation of Client Code

Once the generic type is compiled as above, client code can be written to instantiate and use it.

Metadata

When the generic type is instantiated in client code, and type parameters are specified, those type arguments are substituted in a copy of the generic's metadata. The result is a new concrete type. For most type checking purposes, the generic origins of this new concrete type can be ignored, as long as the applicable constraints are taken into account.

It should be noted that the unique identifier of the generic can be used to trace this instance back to its generic type definition. Similarly for type parameters and their constraints. This filled copy of the metadata is available to the compiler, the linker and the runtime.

Instance Code Generation

Generation of machine code for a specific instance of the generic type depends on factors such as whether a supplied type argument is a value type or not, and whether it is constrained or not. The following high-level logic could help illustrate the mechanism.

generateClientCode() {
    create a local copy of the generic type's metadata structure
    fill the metadata with actual type arguments

    // Now we have a new concrete type.
    if new concrete type in package dictionary {
        // Not really a new type.
        register this type to use the corresponding code
        return
    }

    if new concrete type type-class-compatible with an existing type {
        // We can still share code.
        // Also note that interface types also fit in here.
        register this type to use the corresponding code
        return
    }

    read the serialized AST of the generic from the package of origin

    for each type-argument {
        type check against any applicable constraint

        if type-argument is value-type {
            handleValueType(type-argument)
        } else {
            handleReferenceType(type-argument)
        }
    }

    generate binary code from the AST, filling type information

    register new concrete type in package dictionary with pointer to generated code
}

//

handleValueType(type-argument) {
    if type-argument belongs to a built-in type-class {
        use the pre-selected (widest) element size
    } else {
        use the actual size of the given type
    }
}

handleReferenceType(type-argument) {
    use the predefined size of standard interface
}

The package dictionary referred to is always exported, i.e., it is available to the compiler, the linker and the runtime. This is independent of whether a particular generic instance is itself exported.

I currently think that the above exported information need not contain package namespace qualifiers. But, I could be wrong in thinking so, and namespace qualifiers may be necessary for some purpose.

In order to facilitate the process of looking up the size of a given value type at run time, a hidden parameter is inserted into all applicable function and method signatures. The corresponding type handle of the supplied type argument is passed into the said hidden parameter at run time. When multiple type arguments are involved, a single consolidated meta handle of their type handles is passed into the hidden parameter. This simplifies the internal signatures of the functions/methods.

Notes

Since Go does not support traditional inheritance, there is no transitive closure of types that we need to take into account. Thus, the amount of code explosion is much less than what it would be had transitive closure been applicable. It should be noted that this applies independent of whether generic metadata is used to reuse generated code or not. But, the combination of the two results in a big win over what causes code bloat otherwise.
A topic that is not discussed relates to partial specialization. It can be dealt with once the mechanisms governing the simpler cases are understood and established.

Linking

During linking, the generic metadata exported from various packages is checked to see if there are redundant copies of:

the same metadata in different packages and
the same code in different packages (library files).

A global dictionary is created by merging the above information from all packages. This global dictionary is constructed using the exported package-level dictionaries themselves. The detected redundant copies are not linked into the final executable.

Reflection Capabilities

Reflection of generic types at run time is possible. Type handles and function descriptors are utilized for this purpose. At least the following capabilities are provided through the reflection API.

Query if a type is generic.
Query the number of type parameters.
Query the number of type-specific parameters.
Query the number of function-specific (or method-) parameters.
Retrieve specific or all type handles.
Retrieve the information within a type handle or a function descriptor.

The specific form of the API can be determined later to suit the philosophy of the current reflection API.

A proposal for generics in Go

2011-11-02T21:29:00.000+05:30

N.B. This is a work in progress. It is intended to help me understand the subject better, elicit comments from more knowledgeable people, and elucidate the nuances of generics in Go's context. The following proposal makes use of a syntax for ease of presentation. The syntax is, however, not intended to be the primary subject, nor is it intended to represent the complexities or possible complications of the underlying mechanisms.

Background

In Go, there are fundamental differences between built-in types and user-defined types. When we try to come up with mechanisms that help treat both of them on a completely equal footing, it appears to be stretching the language along difficult dimensions. Particularly troublesome are issues such as the following.

Built-in types (like int and float64) do not have methods; they only have a pre-defined set of operators. Users cannot redefine those operators, nor can they define new operators.
Operator overloading is not allowed, preventing user-defined types from exhibiting operator compatibility with the built-in types.
Operators are not methods themselves, and hence do not allow a symmetry between the built-in operations and the user-defined.

Storage Classes

Type Classes

Types T1 and T2 belong to the same type class if the following are true:

T1 and T2 belong to the same storage class and
every operation defined on elements of T1 is also defined on those of T2, and vice versa.

Predefined Type Classes

Go language defines a set of predefined type classes for numeric types, as enumerated here.

Integer This type class encompasses the following built-in types: int, int8, int16, int32, int64, uint, uint8, uint16, uint32, uint64 and byte.
Float This type class encompasses the following built-in types: float32 and float64.
Complex This type class encompasses the following built-in types: complex64 and complex128.

Generic Types

A type parameter T can represent any valid, assignable Go type, with the exception of interface{}. The semantics of interface{} interfere with the compile-time checks that generics require. Moreover, the choice of interface{} for a type expressly avoids compile-time specification of a concrete type on the part of the user.

The semantics of generic types are dependent on the type parameters; in particular, value types and pointer types lead to different semantics. Go's generic types are reified, and additional type information is passed through hidden parameters (i.e., the dictionary approach). In the most general case, for efficiency reasons, a distinct reified copy may be generated for each instantiated type class. However, this is an implementation issue.

The dictionary approach incurs some run time cost, but usually does not need the compiler to be available at run time. In my opinion, it is desirable, since that preserves the current Go model.

A type parameter T can appear in the following positions (or sites):

type parameter declaration,
variable type specification,
function/method parameter type specification and
allocation.

Type Parameter Declaration

type HashTable<K, V> struct {
    ...
}

type HashSet<K> HashTable<K, bool>

Variable Type Specification

var h1 HashTable<string, float64>  // Valid.
var h2 HashTable<string, T>        // Invalid; T is not instantiated.

It is allowed to specify a type parameter as the type of a variable, as long as the type parameter has a valid declaration in the current scope.

Function/Method Parameter Type Specification

A function or a method can utilize type parameters available in scope. The function/method will then be generic in those type parameters utilized.

func (h *HashTable<K, V>) At(k K) V {
    ...
}

A function/method may itself be generic in one type parameter or more. This is independent of its being generic in one type parameter or more from an outer scope.

func modulus<T>(v T) float64 {
    ...
}

Allocation

In the presence of type parameters, allocation requirements may be determined at run time. The type class of the type parameter determines the allocation mechanism. Also, value types and pointer types have different semantics, as with non-generic types.

type S struct {
    ...
}

var v = new(HashTable<string, S>)

An Example

The following example illustrates the above.

type Vector<T> []T  // T appears in the position of slice type specification.

func (v *Vector<T>) Push(el T) {  // T appears in the position of method parameter type specification.
                                  // Note that T needs to be specified again in Vector<T> for it to be in scope.
    *v = append(*v, el)
}

func (v *Vector<T>) Map<U>(mapper func (T) U) *Vector<U> {  // U is a method-specific type parameter.
    u := new(Vector<U>)                                     // Allocation depends on U.
    for _, el := range *v {
        u.Push(mapper(el))
    }
    return u
}

Constrained Generic Types

Since an unconstrained generic type cannot assume anything specific about the nature of the types eventually instantiated into its type parameters, the set of possible operations on the elements of such types is very small. Constraints can improve the situation by allowing a wider set of operations. A type parameter can optionally be constrained to only those types that satisfy a particular interface. This enables all the methods from the interface to be utilized in the context of variables belonging to the applicable type parameter.

type Comparable<E> interface {
    Compare(E) int
}

type OrderedSet<E : Comparable<E>> struct {
    ...
}

The above declaration allows ordered sets to be constructed over any type whose elements are ordered according to the Compare() relation.

Similarly, a type parameter can optionally be constrained to only those types that belong to a particular type class.

type Vector<T : Integer> []T  // A vector over all integer types.

The above declaration allows a generic vector type to be written with algorithms that provide operations for any integral type.

Constraints on type parameters can appear at all sites except allocation sites. It is illegal to specify more than one constraint for a given type parameter.

Linux distribution trials end, hopefully!

2011-08-06T08:16:00.001+05:30

I have used several Linux distributions over the last 16-17 years. For many years, my primary Linux distribution was SuSE/openSUSE. Since 2007, it has mostly been Ubuntu. Starting with the release 10.10, though, Ubuntu's (in)stability had become increasingly concerning. Release 11.04 only made matters worse. Window borders were erratic in their display, transitions between workspaces was jarring, the feel was sluggish, and an update even rendered the system unable to restart!

Initially, I thought that Ubuntu's Unity was playing tricks. It perhaps interfered with standard GNOME in such ways as to complicate GDM, power management, etc. However, I could not invest the time needed to investigate the problems. Then, I tried Linux Mint 11. It sure did not have Ubuntu's Unity, but had (almost) all the other problems that Ubuntu 11.04 had.

For the first time, I then tried Arch Linux. I was impressed! It is a small, fast, no-nonsense distribution, and has excellent documentation — the best that I have seen for any Linux distribution at all. Added to that, it is a rolling distribution! We, of course, have to allow that it is meant for people who are rather technically advanced. My team thought it was too much of a hassle to even set up a `working' development machine.

Two weeks ago, I tried CentOS 6. My previous trial of CentOS was with version 5.4. Its power management was so terrible that my laptop used to drain out in about 1.5 hours. It used to last over 3 hours otherwise.

This time, though, it is Just Good! I installed the so-called `Minimal Desktop' flavor. In these two weeks of exclusive use, it has demonstrated itself to be very stable, fast and predictable. I am very impressed! All going well, I will be switching my team to CentOS 6, as well, in the coming week or so!

Performance improvement - 3

2011-07-28T14:18:00.002+05:30

Hmm … a new addition to the ring detection algorithm set brings back the distance matrix into the game. Over the last week, I was interacting with potential users, trying to assess the accuracy of the algorithm. They wanted an SSSR (Smallest Set of Smallest Rings) result set for each input molecule, not a more exhaustive set that I was deeming necessary.

Accordingly, I implemented a new SSSR algorithm. This relies on eliminating contours which can be short-circuited. That, in turn, requires shortest paths between all atoms - pair-wise - to be determined first. So, the matrix is back!

Consequently, the running time for my test set of 19946 molecules is now in the range of 16.5-17s. Preliminary profiling shows that building this matrix for each molecule takes almost 60% of the running time. I should investigate better spanning tree algorithms!

LOWESS normalization

2011-07-12T15:41:00.000+05:30

Several years ago, I led a team that developed a bioinformatics product, specifically for managing and analyzing data from microarray experiments. I wrote several of the algorithms needed (including those for hierarchical clustering and K-means clustering), together with parts of the underlying mathematical machinery (such as matrices and operations on them).

Even when I did not write an algorithm myself, I often used to write a verion (sans error-checking) in Ruby. It was used either as a basis for a production version in C++, or to compare the output from an independently-developed C++ version. Either way, those Ruby versions were useful.

One such that I wrote was for performing locally-weighted robust regression, usually known as LOWESS normalization. It is actually a rather small algorithm. However, my team felt lazy, and included the Ruby version directly in the release branch. A few days later, as was the usual practice, I asked my team to produce the C++ version for review. I was told that the Ruby version itself was being directly incorporated into the release branch. I was alarmed, and for two reasons:

the Ruby version was itself untested, and
we were dealing with huge data sets.

Ruby is notorious for being slow. I tried convincing the team that we could not afford one algorithm to take disproportionately long time in comparison to all the other algorithms in the product. However, we were running out of time, and the only other person working on the algorithms had quit by then. Perforce, I had to spend the little time available on testing the Ruby version rather than rewriting it in C++. The reference software used was R.

Much later, I decided to extract that algorithm, make it independent, and publish it under a liberal license. It was done with the consent of my company, and the result was published on Rubyforge. At that time, I announced its availability on ruby-talk mailing list, but on no bioinformatics list per se. Despite that, when I looked at the project statistics today – after a few years – I am pleasantly surprised to see that it has been downloaded over 320 times!

Here is the README file therein.

* * * *

BACKGROUND

Microarray data normalization is an important pre-processing step for obtaining data that is reliable and usable for downstream analysis.

LOWESS normalization, or robust locally weighted regression, is one of the most commonly utilized within-slide normalization techniques. Within-slide normalization aims to correct dye incorporation differences which affect all the genes similarly, or genes with comparable intensity, similarly. Therefore, LOWESS normalization is a useful technique when we suspect a systematic dye-dependent bias.

LOWESS

Control-, house keeping-, or dye swap-based experiments assume the existence and availability of certain genes whose expression levels can be treated to be constant across a wide range. In comparison, LOWESS is much more resilient to range, as well as the type of study.

The LOWESS algorithm, as applied to dual-channel microarrays, utilizes a locally weighted polynomial regression of the A-M scatterplot in order to obtain the calibration factor to be applied, where:

    A    = log₂(√(i_R * i_G)),
    M    = log₂(i_R / i_G),
    i_R   = intensity in the red channel, and
    i_G   = intensity in the green channel.

for each spot on the microarray.

A general LOWESS algorithm is parameterized by five variables:

order of the polynomial,
number of regression iterations,
weight function,
subset of the input space considered for regression, and
number of points to regress at.

Of the above, I have restricted the order of the polynomial to 1, and the number of regression iterations, also, to 1 (see references for why these are sufficient in a vast majority of cases). These parameters are normally referred to as degree of polynomial and iterations, respectively.

The weight function employed is that which is most widely used: the tri-cube function, defined as

    weight = (1 - (d/m)³)³,

where,

    d = distance between the point of regression and the current point, and
    m = maximum distance in the interval.

The operating subset of the input space, for regression, is determined using a fraction in the half-open interval (0, 1]. This fraction is usually called the smoothening parameter.

The input space is divided into a number of cells, and a representative point is picked up from each cell for regression. The number of elements in such cells is determined by a fraction in the half-open interval (0, 0.1]. This fraction is usually called delta.

As can be readily seen, too high values for smoothening parameter tend to over-smoothen the plot, while too low values are very sensitive to outliers. Too high values for delta lead to fewer points at which regression is performed, resulting in a coarse approximation, while too low values would yield higher accuracy, but at the expense of more computing power.

MODES SUPPORTED

Normalization can be performed on either all the elements of the microarray, or by pin/zone/subarray.

HOW TO RUN

This sciprt requires Jeff Mitchell's 'linalg' for matrix manipulations. It is available from the Ruby Application Archive. Please install that before running this script.

The input data must be in a tab-separated text file, and must not contain any header or trailer lines.

The file 'main.rb' is the driver that needs to be run. Run it without any arguments, and a small usage notes is printed.

REFERENCES

Cleveland, W.S., "Robust locally weighted regression and smoothing scatterplots", J. Amer. Stat. Assoc. 74, 829-836 (1979).
Yang, I.V. et al., "Within the fold: assessing differential expression measures and reproducibility in microarray assays", Genome Biol. 3, research0062.1-0062.12 (2002).
Quackenbush, J., "Microarray data normalization and transformation", Nature Genetics. Vol.32 supplement pp496-501 (2002).
Berger, J.A. et al., "Optimized LOWESS normalization parameter selection for DNA microarray data", BMC Bioinformatics 2004, 5:194.

One of Many Worlds

Go, GitHub and Travis : a small lesson in dependencies

On Effects

Background

The Beginning

Unmet Need

Explicit Effects

Java

Haskell

D

Nimrod

Evolution of My Thoughts

Tracking Effects

Compiler-Defined Effects

User-Defined Effects

Propagation of Effects

Another go at Go ... failed!

The Problem

Programming Language

Practical Issues With Go

No Set in the Language

No BitSet in the Standard Library

Generics, or Their Absence

Poor Performance

Enhancement Requests

Back to Java!

Conclusions

Captions on Indian trucks - an unexpected lesson

Ring Detection - 1

Preparation

Removing Terminal Chains

How Many Rings?

One Ring or More?

Cycle Detection

Case of One Ring

Case of Multiple Rings

Process a Test Path

Preliminary Validation

Ring System Detection

Partitioning

Conclusion

C++11's strongly-typed enums

Operational Disaster

Operational Disaster

Disaster Strikes!

What Had Happened?

More Disaster!

Rescue Attempt

Go The Linux Way!

Laptop Power Management

What Do I Not Like?

What Do I Miss?

Return to blogging!

Trivia

Largely distracted months

Experiments with languages

Haskell

Dart

D

Others

The changing face of urban Hyderabad

Graphs and molecules - 2

Ordering

A little rigour

A distinction

More ordering

Sequence comparison

Bottom and top elements

Minimal and maximal elements

Conclusion

Linux distribution chosen!

Debian Wheezy : updating Java alternatives

Dependencies can upset the apple cart

Simple remedy

Convergent synthesis, finally!

What is convergent synthesis?

How does it work in retro-synthesis?

What next?

Difficult decision - 2

Plug-in

No `Set` in the Language

No `BitSet` in the Standard Library