One of Many Worlds: 2010

2010-11-22

Large files and multiple cores

As part of my product, I have an available chemicals database (ACD) that is compiled from third-party catalogs. Such catalogs are typically in SDF file format, as specified by MDL. I ran into some issues when importing them, resulting in aborted runs. It was getting tedious to locate the troublesome molecule in the vastness of the SD files. After a while, the programmer in me woke up! The consequence is a small (~750 LoC) utility written in Go, which I blandly called sdf.

~ % sdf help

Usage:

sdf help
Prints this usage notes message, and exits.

sdf show in=file [from=m] [to=n]
Fetches and displays molecules from the file 'file',
starting with the molecule numbered 'm', optionally to the
molecule numbered 'n'. If 'from' is not specified, the
first molecule is used as the starting molecule. If 'to'
is not specified, all molecules until the end of file are
displayed. Specifically, to display only the m'th molecule,
you should specify 'from=m to=m'.

sdf copy in=file1 out=file2 [from=m] [to=n]
Similar to 'show' above, difference being that the output
is written to 'file2'. Any existing 'file2' will be
truncated.

sdf searcha in=file [from=m] [to=n] symbol=count [symbol=count] [mx=c]
Performs a search for the first molecule in the given range
that has the specified number of atoms of each element type.
The number of processor cores to use can be specified using
'mx'; default is 2.

sdf searcht in=file [from=m] [to=n] tag=name tagval=value [tag=name tagval=value] [mx=c]
Performs a search for the first molecule in the given range
that has the specified tags and values. The number of
processor cores to use can be specified using 'mx'; default
is 2.

2010-11-13

Java and native code

Last week, one of my clients' projects ran into a problem. The team had introduced a new feature in the latest release. It ran fine in testing, but began having mysterious crashes in production. Upon some investigation, they found that a particular third-party native library (`.so' file) was causing the error. This native library was used to compute the InChI value of a given molecule. The application itself is a Java Web application written using the Struts framework.

Initially, there was a path related problem. They solved it. Then there was a 32-bit vs. 64-bit problem (or so they thought). They solved it. Then there was some version mismatch. They solved it. At that point, they moved it into production. Presently, the mysterious crashes started. Yesterday, I was called in.

Even before I landed there, my thinking was: (a) such molecules could be known corner cases of the third-party library, or (b) there is a memory corruption (e.g. JVM trying to garbage-collect memory allocated by C runtime).

I began by understanding the issue's history. I familiarized myself with the part of the code that was interfacing with the native code. Together with the team, I set up various scenarios to reproduce the crash. After a couple of hours, the first crash did occur. Using trace information, we looked up the forums of the third-party vendor who supplied the native library and its wrapper Java library. The forums had discussions concerning some of the exactly same scenarios that we had set up for testing. They were stated to be known problems. We also saw that a newer version of the vendor library was available. At this point, it appeared as if my first guess was probably correct.

Re-running the scenarios with the updated library caused similar crashes, nevertheless! After a couple of more hours, the situation appeared to have reached a dead end.

Then, I suddenly saw a pattern in the crashes -- they all happened only in multi-user scenarios! Then I remembered my second line of reasoning. It was memory corruption, but probably due to global data in the native library! We quickly verified this by simultaneously launching the said functionality in independent JVM instances. There were no crashes!

So, that indeed was the problem! The native library was not thread-safe! We then re-wrote that part of the application to factor the native part into a separate process to be run in a new instance of JVM. Each client request invoking that particular functionality would run in a separate JVM, handing the results over in a text file, back to the mother program. With thread-safety problem out of the way, the application went back to running happily!

2010-11-04

Happy ending!

I was thinking of filing an issue for the stat problem of Go for Windows. I was about to do that this evening, when I received my daily digest of Go mailing list. I found an announcement therein that a new release of Go was made yesterday night. The release notes said, to my disbelief, that the stat problem in Windows was solved!

I eagerly downloaded the new version for Windows, and re-compiled the SHA-1 summing program with it. I ran it, I admit, with some trepidation. It ran as it should (have) -- smoothly!

Now, that is some coincidence! Happy ending, thusly!

2010-11-03

Fruition :-)

I spent a couple of hours rewriting the Go version of the program in C++ using Qt. This version worked flawlessly, much to my relief!

Qt is a very well-designed library! This is the same feeling that gets reinforced every time I work with it.

I used a class called QDirIterator, whose function is self-describing. While Qt supports STL-style iterators, it recently introduced Java-style iterators. These new iterators support hasNext() and next() combination for traversal through the containers. Needless to say, these are much easier to use than the corresponding STL-style iterators.

In all, I enjoyed the program's rewrite exercise, which also came to fruition!

2010-10-28

Fruitless effort :-(

I wrote a small program to compute SHA-1 sums of directory trees, to easily identify and locate files occurring more than once. I wrote it in Linux using Go language. I tested it on large trees of sizes in the range 1-4 GB. I cross-checked the results - through random sampling - using `sha1sum', and I was happy! I then booted into Windows 7 VMware image and compiled it. Then began the trouble.

Go is not officially supported on Windows yet. There is an unofficial port that closely follows the official releases. It is clearly marked `experimental'. I should have realized that when an OS is not yet officially supported, the weakest areas would be OS interfaces. The inane dud that I am, I did not think enough!

While walking the directory tree, I was doing a `stat' of each entry: (a) to find whether it is a directory or a file, and (b) if it is a file, to know its size in bytes. In Windows, the `stat' call was failing. After some gymnastics, the program was walking the directory tree. But, some directories were getting recognized as files. In addition, I could not make it `stat' the entries whose full path contained spaces. I have exhausted all conventional mechanisms to no avail.

So, I give up! At least, for now. I have begun using Go for serious work. I have not encountered any issues with the runtime so far. But ... that is in Linux. From a few small programs that I run in both Linux and Windows, I see that there are no issues with Go ... as long as OS interfaces are not used in Windows!

My choice of language for this particular job was wrong, leading to this fruitless effort. Sigh!

Modern computing

I came across this in Go mailing list today. Irresistible!

Such is modern computing: everything simple is made too complicated because it's easy to fiddle with; everything complicated stays complicated because it's hard to fix.

— Rob Pike

2010-10-04

Time problems

Last week, I encountered two interesting problems.

Background: The product that I am building has a newly-developed licensing feature. The feature uses hardware information of the user's machine a la Windows activation. In addition, licenses are annual, so the program must commit suicide at the end of one year.

Problem 1: After the first successful run of the product, all attempts at running it again were failing with the error message `Current time older than that of the most recent run.' The time of last run is updated in the license file (which is an encrypted file) at the beginning of each run. After several attempts at debugging, I could find nothing. I switched to other work, and returned a few hours later. To my surprise, the program ran successfully, but just once! It then returned to the now-familiar error message.

I noticed that 4 hours had elapsed between the two successful runs. I suddenly remembered that I was storing the time of last run as UTC, and was signed in into my collaborator's computer at Toronto (UTC-4:00). Upon searching the code for all instances of storing/reading time, I noticed that conversions of local time to and from UTC were inconsistent. I made them consistent, and Problem 1 disappeared.

Problem 2: No sooner had I finished resolving the above than I needed to run the program - in batch - about 6000 times. I wrote a shell script to do that. I tested it for 10 runs. Surprise! The error message was back, only even more confusingly. Jobs 1 and 2 ran successfully. Job 3 failed with the above error. Job 4 ran successfully. Job 5 failed. Jobs 6 and 7 ran successfully. Job 8 failed. Jobs 9 and 10 ran successfully. The actual jobs that ran successfully changed but slightly each time I re-ran the shell script. It really left me scratching my head.

I then noticed that there was a pattern. The jobs before the failed ones completed really quickly. I inserted a `sleep 1' after each job in the shell script. Now, every job completed successfully!

I then checked the license verifier code. One of the important checks was that current time should be greater than that of the last run. However, resolution of time was in seconds. Thus, if a job completed in less than a second, its immediately next job would fail. So, I modified the condition to current time should not be less than that of the last run. The shell script ran happily ever after!

2010-09-27

The competent programmer

The competent programmer is fully aware of the limited size of his own skull. He therefore approaches his task with full humility, and avoids clever tricks like the plague.

— Edsger W. Dijkstra

2010-09-23

Debugging a non-existent bug

I visited one of my clients last week, to find a tester of one of their products waiting with a critical bug report. The product homepage was giving rise to a warning `This site is potentially harmful to your computer. ...' I tried, and did not receive such a message. He walked away perplexed.

Five minutes later, he called saying that this `bug' was reproducible, but only in his machine. I asked him which browser he was using. ``IE 6," he responded. I asked him to check using Firefox, and report back.

He called right back. His voice sounded even more perplexed. ``Not reproducible using Firefox, sir," he said feebly.

``Yes, there is a bug," I retorted, ``not in your product, but in your browser!"

2010-09-21

Maturation

The process of maturation takes so long that when it does complete:

it is easy to overlook the total effort, since the effort spent at each juncture is little, and
it is easy to overlook its impact, since the learning appears `natural'.

2010-09-07

Four stages of learning

Acquiring a skill has, according to Indian tradition, four stages: adheeti, bOdha, aacharaNa and prachaaram.

adheeti (అధీతి) involves studying the subject with due attention.
bOdha (బోధ) involves distilling what is studied into the essential knowledge of the subject.
aacharaNa (ఆచరణ) is putting the knowledge to practice.
prachaaram (ప్రచారం) involves passing that practical knowledge to others.

2010-08-16

Short war story

One of my clients upgraded their network cabling over the weekend. Today morning, all connection attempts to all Oracle database servers started taking too long to materialize. The systems administrator called me in the afternoon, apparently, after exhausting all possible solutions that he could think of.

I asked him if ssh to those database servers (all of which run Linux) was connecting immediately. He checked, and replied that it was. Initially, I too was confused. I then asked him if the same IP addresses (and not any other virtual addresses) were being used in tnsnames.ora on the application servers.

Then came the answer. He checked, and replied that tnsnames.ora did not have any IP addresses at all. It had only host names! The obvious culprit, then, was DNS. When the DNS server was restarted, the world returned to normality.

2010-08-10

Healthy projects

Most successful software projects, that also balance the Cost-Time-Quality triangle, seem to exhibit these properties.

Consistently high involvement of the users throughout the life cycle.
Simple designs. By simplicity, I mean most of the developers being able to hold the designs right in their heads.
Good integrators, who review code.