Last week, one of my clients' projects ran into a problem. The team had introduced a new feature in the latest release. It ran fine in testing, but began having mysterious crashes in production. Upon some investigation, they found that a particular third-party native library (`.so' file) was causing the error. This native library was used to compute the InChI value of a given molecule. The application itself is a Java Web application written using the Struts framework.
Initially, there was a path related problem. They solved it. Then there was a 32-bit vs. 64-bit problem (or so they thought). They solved it. Then there was some version mismatch. They solved it. At that point, they moved it into production. Presently, the mysterious crashes started. Yesterday, I was called in.
Even before I landed there, my thinking was: (a) such molecules could be known corner cases of the third-party library, or (b) there is a memory corruption (e.g. JVM trying to garbage-collect memory allocated by C runtime).
I began by understanding the issue's history. I familiarized myself with the part of the code that was interfacing with the native code. Together with the team, I set up various scenarios to reproduce the crash. After a couple of hours, the first crash did occur. Using trace information, we looked up the forums of the third-party vendor who supplied the native library and its wrapper Java library. The forums had discussions concerning some of the exactly same scenarios that we had set up for testing. They were stated to be known problems. We also saw that a newer version of the vendor library was available. At this point, it appeared as if my first guess was probably correct.
Re-running the scenarios with the updated library caused similar crashes, nevertheless! After a couple of more hours, the situation appeared to have reached a dead end.
Then, I suddenly saw a pattern in the crashes -- they all happened only in multi-user scenarios! Then I remembered my second line of reasoning. It was memory corruption, but probably due to global data in the native library! We quickly verified this by simultaneously launching the said functionality in independent JVM instances. There were no crashes!
So, that indeed was the problem! The native library was not thread-safe! We then re-wrote that part of the application to factor the native part into a separate process to be run in a new instance of JVM. Each client request invoking that particular functionality would run in a separate JVM, handing the results over in a text file, back to the mother program. With thread-safety problem out of the way, the application went back to running happily!