Friday, February 13, 2015

What about testing like the shuttle?


Richard Feynman's appendex to the Challenger disaster is an interesting read. It is often taken as the best possible way to write quality code. The process you use when money is not as important as bug free code. One cannot argue with results, and that process got very close. What lessons should we take from their process? This post will focus on the following two paragraphs in particular.

The software is checked very carefully in a bottom-up fashion. First, each new line of code is checked, then sections of code or modules with special functions are verified. The scope is increased step by step until the new changes are incorporated into a complete system and checked. This complete output is considered the final product, newly released. But completely independently there is an independent verification group, that takes an adversary attitude to the software development group, and tests and verifies the software as if it were a customer of the delivered product. There is additional verification in using the new programs in simulators, etc. A discovery of an error during verification testing is considered very serious, and its origin studied very carefully to avoid such mistakes in the future. Such unexpected errors have been found only about six times in all the programming and program changing (for new or altered payloads) that has been done. The principle that is followed is that all the verification is not an aspect of program safety, it is merely a test of that safety, in a non-catastrophic verification. Flight safety is to be judged solely on how well the programs do in the verification tests. A failure here generates considerable concern.

To summarize then, the computer software checking system and attitude is of the highest quality. There appears to be no process of gradually fooling oneself while degrading standards so characteristic of the Solid Rocket Booster or Space Shuttle Main Engine safety systems. To be sure, there have been recent suggestions by management to curtail such elaborate and expensive tests as being unnecessary at this late date in Shuttle history. This must be resisted for it does not appreciate the mutual subtle influences, and sources of error generated by even small changes of one part of a program on another. There are perpetual requests for changes as new payloads and new demands and modifications are suggested by the users. Changes are expensive because they require extensive testing. The proper way to save money is to curtail the number of requested changes, not the quality of testing for each.
One might add that the elaborate system could be very much improved by more modern hardware and programming techniques. Any outside competition would have all the advantages of starting over, and whether that is a good idea for NASA now should be carefully considered.

As noted, this process does not (did not? it isn't clear what process NASA now uses) take advantage of modern techniques.  In particular there doesn't seem to be automated testing. Of course the report was written in the 1980s, I don't think anyone had considered the idea back then (at least not in the from I'm interested in), so we can't fault NASA for not using it. However today we would want to automate the tests. This could save a significant amount of money over the years.

For all the careful attention and analysis, they had bugs. Nothing serious fortunately, but enough to remind us it isn't perfect either. Still this is generally considered the process that produced software with the least amount of bugs. It is also expensive, very few of us could convince our boss to write any software if we had to spend that much money to get it done. Of course most software failures won't become the defining story for an entire generation. Thus they don't demand that much attention. This cost issue is a business decision, but I'm not sure if any business has done a cost/benefit analysis to decide if they have the right compromise between quality and support. Here is the first lesson: the business people should make the decision on spending more money for quality vs more features.

Back to the description, one point really stands out: There appears to be no process of gradually fooling oneself while degrading standards. If this is true, it would be one of the greatest achievements of any management anywhere. It is easy to fool yourself into thinking what you just did was great when it is not. As a programmer, the most common way I've seen of fooling yourself is never using the combinations that don't work right. Always using a keyboard shortcut because the buttons are not reliable, or always turning on feature A before feature B because the opposite order doesn't work even though it should.

The attitude to test failures is interesting. A discovery of an error during verification testing is considered very serious, and its origin studied very carefully to avoid such mistakes in the future. This focus on test failures as embarrassing is unusual in the TDD world where all tests are written first. This may in fact be a fault of NASA's system: if you don't test your test how do you know they can detect a failure. TDD starts with a red test, and so it is clear that (at least originally) the test can detect a failure.

However that might be looking at testing wrong. NASA probably was dong some form of developer unit tests, and those probably did fail without a serious origin study. It seems likely that NASA expects their developers to not consider code done until they are sure it is perfect. In that context their testing makes more sense. The principle that is followed is that all the verification is not an aspect of program safety, it is merely a test of that safety. Despite the agile hype that testers are not required because everyone is a tester, I don't buy it. Manual verification by a third person is valuable. In Uncle Bob's talk about "the reasonable expectations of your CEO"  he says "I want the testers to wonder why they have a job: they never find anything." NASA seems to have captured this long ago. Testers do a lot of work, but should never find anything.

In fact there is a separate test group at NASA:  But completely independently there is an independent verification group. This group seems to be the one doing those tests that shouldn't fail. In fact:  A failure here generates considerable concern. I would like more detail on what considerable concern means. In absence of knowledge I'm willing to speculate on two things that could be useful: 5 whys is a great technique to get to the bottom of any issue. Imagine if every issue the testers found generated a 5 whys response and got a root cause fixed.

There is a downside to all this though: From personal conversations with other programmers who have worked under similar processes, I've been told it is not fun to do work at NASA. The amount of time spent in review and process is boring and demotivating. The "cowboy programmer" has a lot more fun. To be fair, even while the "cowboy programmer" enjoys his job, most of them are well aware that the quality of software they produce is not acceptable.

What we really want is a the best of both worlds: high quality software and the fun "cowboy programmer" experience. If we can improve on NASA's quality that would be nice as well, but lets not fool ourselves into thinking that is easy. NASA managed to set a high bar without a lot of the modern things we know now.

So back to the question, should we learn anything from NASA, or is their process obsolete? The answer is we have things to learn. First, don't fool yourself. Second don't release software to test if there are any bugs, and if any find their way to test anyway treat it like a serious issue. Third, management should consider the tradeoffs and decide how much quality to pay for. If you make these your goal and work to make them realistic it will go a long way towards increasing the quality of your code.