One of Many Worlds: 07/01/2011

2011-07-28

Performance improvement - 3

Hmm … a new addition to the ring detection algorithm set brings back the distance matrix into the game. Over the last week, I was interacting with potential users, trying to assess the accuracy of the algorithm. They wanted an SSSR (Smallest Set of Smallest Rings) result set for each input molecule, not a more exhaustive set that I was deeming necessary.

Accordingly, I implemented a new SSSR algorithm. This relies on eliminating contours which can be short-circuited. That, in turn, requires shortest paths between all atoms - pair-wise - to be determined first. So, the matrix is back!

Consequently, the running time for my test set of 19946 molecules is now in the range of 16.5-17s. Preliminary profiling shows that building this matrix for each molecule takes almost 60% of the running time. I should investigate better spanning tree algorithms!

2011-07-12

LOWESS normalization

Several years ago, I led a team that developed a bioinformatics product, specifically for managing and analyzing data from microarray experiments. I wrote several of the algorithms needed (including those for hierarchical clustering and K-means clustering), together with parts of the underlying mathematical machinery (such as matrices and operations on them).

Even when I did not write an algorithm myself, I often used to write a verion (sans error-checking) in Ruby. It was used either as a basis for a production version in C++, or to compare the output from an independently-developed C++ version. Either way, those Ruby versions were useful.

One such that I wrote was for performing locally-weighted robust regression, usually known as LOWESS normalization. It is actually a rather small algorithm. However, my team felt lazy, and included the Ruby version directly in the release branch. A few days later, as was the usual practice, I asked my team to produce the C++ version for review. I was told that the Ruby version itself was being directly incorporated into the release branch. I was alarmed, and for two reasons:

the Ruby version was itself untested, and
we were dealing with huge data sets.

Ruby is notorious for being slow. I tried convincing the team that we could not afford one algorithm to take disproportionately long time in comparison to all the other algorithms in the product. However, we were running out of time, and the only other person working on the algorithms had quit by then. Perforce, I had to spend the little time available on testing the Ruby version rather than rewriting it in C++. The reference software used was R.

Much later, I decided to extract that algorithm, make it independent, and publish it under a liberal license. It was done with the consent of my company, and the result was published on Rubyforge. At that time, I announced its availability on ruby-talk mailing list, but on no bioinformatics list per se. Despite that, when I looked at the project statistics today – after a few years – I am pleasantly surprised to see that it has been downloaded over 320 times!

Here is the README file therein.

* * * *

BACKGROUND

Microarray data normalization is an important pre-processing step for obtaining data that is reliable and usable for downstream analysis.

LOWESS normalization, or robust locally weighted regression, is one of the most commonly utilized within-slide normalization techniques. Within-slide normalization aims to correct dye incorporation differences which affect all the genes similarly, or genes with comparable intensity, similarly. Therefore, LOWESS normalization is a useful technique when we suspect a systematic dye-dependent bias.

LOWESS

Control-, house keeping-, or dye swap-based experiments assume the existence and availability of certain genes whose expression levels can be treated to be constant across a wide range. In comparison, LOWESS is much more resilient to range, as well as the type of study.

The LOWESS algorithm, as applied to dual-channel microarrays, utilizes a locally weighted polynomial regression of the A-M scatterplot in order to obtain the calibration factor to be applied, where:

    A    = log₂(√(i_R * i_G)),
    M    = log₂(i_R / i_G),
    i_R   = intensity in the red channel, and
    i_G   = intensity in the green channel.

for each spot on the microarray.

A general LOWESS algorithm is parameterized by five variables:

order of the polynomial,
number of regression iterations,
weight function,
subset of the input space considered for regression, and
number of points to regress at.

Of the above, I have restricted the order of the polynomial to 1, and the number of regression iterations, also, to 1 (see references for why these are sufficient in a vast majority of cases). These parameters are normally referred to as degree of polynomial and iterations, respectively.

The weight function employed is that which is most widely used: the tri-cube function, defined as

    weight = (1 - (d/m)³)³,

where,

    d = distance between the point of regression and the current point, and
    m = maximum distance in the interval.

The operating subset of the input space, for regression, is determined using a fraction in the half-open interval (0, 1]. This fraction is usually called the smoothening parameter.

The input space is divided into a number of cells, and a representative point is picked up from each cell for regression. The number of elements in such cells is determined by a fraction in the half-open interval (0, 0.1]. This fraction is usually called delta.

As can be readily seen, too high values for smoothening parameter tend to over-smoothen the plot, while too low values are very sensitive to outliers. Too high values for delta lead to fewer points at which regression is performed, resulting in a coarse approximation, while too low values would yield higher accuracy, but at the expense of more computing power.

MODES SUPPORTED

Normalization can be performed on either all the elements of the microarray, or by pin/zone/subarray.

HOW TO RUN

This sciprt requires Jeff Mitchell's 'linalg' for matrix manipulations. It is available from the Ruby Application Archive. Please install that before running this script.

The input data must be in a tab-separated text file, and must not contain any header or trailer lines.

The file 'main.rb' is the driver that needs to be run. Run it without any arguments, and a small usage notes is printed.

REFERENCES

Cleveland, W.S., "Robust locally weighted regression and smoothing scatterplots", J. Amer. Stat. Assoc. 74, 829-836 (1979).
Yang, I.V. et al., "Within the fold: assessing differential expression measures and reproducibility in microarray assays", Genome Biol. 3, research0062.1-0062.12 (2002).
Quackenbush, J., "Microarray data normalization and transformation", Nature Genetics. Vol.32 supplement pp496-501 (2002).
Berger, J.A. et al., "Optimized LOWESS normalization parameter selection for DNA microarray data", BMC Bioinformatics 2004, 5:194.

2011-07-01

Performance improvement - 2

No sooner had I finished my previous post than I began reviewing the algorithmic aspects of ring discovery in the program. I observed two points standing out.

During the analysis of the molecule's graph, we build a matrix of pair-wise distances between all atoms of the molecule.
During ring discovery, we remove all terminal chains (remember terminal nodes from here?), and then explore the remaining graph to detect rings.

As I reviewed the code, I saw that I was recursively walking the graph in both cases. When building the distance matrix, the logic is simple: directly bonded neighbors are at a distance of 1; recurse over the neighbors of each neighbor, and increment the distance by 1 at each level. Whenever we encounter a shorter path between the same pair, we record this shorter distance. Exercise: Why do multiple paths exist at all? A similar algorithm is followed for ring discovery, but walking only the subgraph comprising non-terminal atoms.

Whenever a ring is detected, it is necessary to determine if it is a Jordan curve (simple closed path) or not. We are interested in Jordan curves; we mark others as fused systems. How do we, then, determine if the current path is a Jordan curve? The parent molecule of every component has a map that holds information about bonds. The key of the map is a hash; the value, the bond itself. Suppose that the bond is between the atoms a₁ and a₂. The hash is computed as: 10000 * a_L.id + a_G.id, where a_L is the atom with the lower ID of the two, and a_G is that with the higher. This map is constructed when parsing the input, and is, consequently, ready by the time ring discovery is requested. Clearly, it is adequate to determine if a given path is a Jordan curve or not. How? :-)

I suddenly recollected that in the first version of the ring discovery algorithm, the distance matrix was used to test if a path was a Jordan curve. In that version, it was necessary to pre-calculate the distance matrix. With it being no longer mandatory, I removed the call to build it during ring discovery.

Lo and behold — the running time dropped from 18.5-19s to 6.5-7s! This time, the improvement in performance had nothing to do with Go, and everything to do with my own oversight. Needless to say, I am happier now! :-)