2010-11-22

Large files and multiple cores

As part of my product, I have an available chemicals database (ACD) that is compiled from third-party catalogs. Such catalogs are typically in SDF file format, as specified by MDL. I ran into some issues when importing them, resulting in aborted runs. It was getting tedious to locate the troublesome molecule in the vastness of the SD files. After a while, the programmer in me woke up! The consequence is a small (~750 LoC) utility written in Go, which I blandly called sdf.

~ % sdf help

Usage:

sdf help
Prints this usage notes message, and exits.

sdf show in=file [from=m] [to=n]
Fetches and displays molecules from the file 'file',
starting with the molecule numbered 'm', optionally to the
molecule numbered 'n'. If 'from' is not specified, the
first molecule is used as the starting molecule. If 'to'
is not specified, all molecules until the end of file are
displayed. Specifically, to display only the m'th molecule,
you should specify 'from=m to=m'.

sdf copy in=file1 out=file2 [from=m] [to=n]
Similar to 'show' above, difference being that the output
is written to 'file2'. Any existing 'file2' will be
truncated.

sdf searcha in=file [from=m] [to=n] symbol=count [symbol=count] [mx=c]
Performs a search for the first molecule in the given range
that has the specified number of atoms of each element type.
The number of processor cores to use can be specified using
'mx'; default is 2.

sdf searcht in=file [from=m] [to=n] tag=name tagval=value [tag=name tagval=value] [mx=c]
Performs a search for the first molecule in the given range
that has the specified tags and values. The number of
processor cores to use can be specified using 'mx'; default
is 2.

No comments: