2011-01-26

Equivalence classes -- a recurring pattern

Within the last couple of months, I have found myself applying one pattern in three different scenarios.

(1) Literature on Clinical Trials

Each clinical trial - usually - has multiple trial aliases. A journal article may cite one or more of them. It is, therefore, possible for two articles to refer to the same trial, but using two different aliases. However, as we compile newer articles, it may be possible to cross-link them based on those that cite multiple aliases.

Example:
- Article A1 refers to trial aliases TA1 and TA5.
- Article A2 refers to trial aliases TA2 and TA4.
- Article A3 refers to trial aliases TA1 and TA3.
- Article A4 refers to trial aliases TA4 and TA5.
- Article A5 refers to trial alias TA3.

Based on the above, the program should infer that they are all referring to the same trial.

(2) Stripping Salts of a Molecule

possible complex may have more than one molecule. Often, it is a combination of a main molecule and one or more salts. Reading from an MDL `.mol' file, the program should recognize the different units, identify the main molecule, and strip the salts. The input is read an atom at a time, followed by a bond at a time.

Obviously, the two atoms participating in a bond belong to one unit. The neighbors of each of them also belong to the same unit, etc.

(3) Fused Rings of a Molecule

A single molecule can have multiple ring systems. Within each such system may exist a set of fused rings. Step 1 is to identify all rings in the system. In Step 2, we have to identify fused rings.

If rings R1 and R2 are fused, they belong to one ring system. The fused neighbors of each of them also belong to the same ring system, etc.

Equivalence Classes

Each of the problems above has the same recurring pattern -- that of equivalence classes! There is a defined equivalence relation on the input set. The program has to utilize that to partition the set. Recognizing that pattern has helped me in streamlining my design and implementation.