Several years ago, I led a team that developed a bioinformatics product, specifically for managing and analyzing data from microarray experiments. I wrote several of the algorithms needed (including those for hierarchical clustering and K-means clustering), together with parts of the underlying mathematical machinery (such as matrices and operations on them).
Even when I did not write an algorithm myself, I often used to write a verion (
sans error-checking) in Ruby. It was used either as a basis for a production version in C++, or to compare the output from an independently-developed C++ version. Either way, those Ruby versions were useful.
One such that I wrote was for performing locally-weighted robust regression, usually known as
LOWESS normalization. It is actually a rather small algorithm. However, my team felt lazy, and included the Ruby version directly in the release branch. A few days later, as was the usual practice, I asked my team to produce the C++ version for review. I was told that the Ruby version itself was being directly incorporated into the release branch. I was alarmed, and for two reasons:
- the Ruby version was itself untested, and
- we were dealing with huge data sets.
Ruby is notorious for being slow. I tried convincing the team that we could not afford one algorithm to take disproportionately long time in comparison to all the other algorithms in the product. However, we were running out of time, and the only other person working on the algorithms had quit by then. Perforce, I had to spend the little time available on testing the Ruby version rather than rewriting it in C++. The reference software used was R.
Much later, I decided to extract that algorithm, make it independent, and publish it under a liberal license. It was done with the consent of my company, and the result was published on
Rubyforge. At that time, I announced its availability on ruby-talk mailing list, but on no bioinformatics list
per se. Despite that, when I looked at the project statistics today – after a few years – I am pleasantly surprised to see that it has been downloaded over 320 times!
Here is the README file therein.
* * * *
BACKGROUND
Microarray data normalization is an important pre-processing step for obtaining data that is reliable and usable for downstream analysis.
LOWESS normalization, or robust locally weighted regression, is one of the most commonly utilized within-slide normalization techniques. Within-slide normalization aims to correct dye incorporation differences which affect all the genes similarly, or genes with comparable intensity, similarly. Therefore, LOWESS normalization is a useful technique when we suspect a systematic dye-dependent bias.
LOWESS
Control-, house keeping-, or dye swap-based experiments assume the existence and availability of certain genes whose expression levels can be treated to be constant across a wide range. In comparison, LOWESS is much more resilient to range, as well as the type of study.
The LOWESS algorithm, as applied to dual-channel microarrays, utilizes a locally weighted polynomial regression of the A-M scatterplot in order to obtain the calibration factor to be applied, where:
A = log2(√(iR * iG)),
M = log2(iR / iG),
iR = intensity in the red channel, and
iG = intensity in the green channel.
for each spot on the microarray.
A general LOWESS algorithm is parameterized by
five variables:
- order of the polynomial,
- number of regression iterations,
- weight function,
- subset of the input space considered for regression, and
- number of points to regress at.
Of the above, I have restricted the order of the polynomial to 1, and the number of regression iterations, also, to 1 (see references for why these are sufficient in a vast majority of cases). These parameters are normally referred to as
degree of polynomial and
iterations, respectively.
The weight function employed is that which is most widely used: the
tri-cube function, defined as
weight = (1 - (d/m)3)3,
where,
d = distance between the point of regression and the current point, and
m = maximum distance in the interval.
The operating subset of the input space, for regression, is determined using a fraction in the half-open interval (0, 1]. This fraction is usually called the
smoothening parameter.
The input space is divided into a number of cells, and a representative point is picked up from each cell for regression. The number of elements in such cells is determined by a fraction in the half-open interval (0, 0.1]. This fraction is usually called
delta.
As can be readily seen, too high values for smoothening parameter tend to over-smoothen the plot, while too low values are very sensitive to outliers. Too high values for delta lead to fewer points at which regression is performed, resulting in a coarse approximation, while too low values would yield higher accuracy, but at the expense of more computing power.
MODES SUPPORTED
Normalization can be performed on either all the elements of the microarray, or by pin/zone/subarray.
HOW TO RUN
This sciprt requires Jeff Mitchell's 'linalg' for matrix manipulations. It is available from the Ruby Application Archive. Please install that before running this script.
The input data must be in a tab-separated text file, and must not contain any header or trailer lines.
The file 'main.rb' is the driver that needs to be run. Run it without any arguments, and a small usage notes is printed.
REFERENCES
- Cleveland, W.S., "Robust locally weighted regression and smoothing scatterplots", J. Amer. Stat. Assoc. 74, 829-836 (1979).
- Yang, I.V. et al., "Within the fold: assessing differential expression measures and reproducibility in microarray assays", Genome Biol. 3, research0062.1-0062.12 (2002).
- Quackenbush, J., "Microarray data normalization and transformation", Nature Genetics. Vol.32 supplement pp496-501 (2002).
- Berger, J.A. et al., "Optimized LOWESS normalization parameter selection for DNA microarray data", BMC Bioinformatics 2004, 5:194.