The Well Tested Programmer

Friday, March 6, 2015

Quality over time.

The post on testing like the shuttle noted that the software used on the shuttle was old. That opens the question: is quality just a matter of time - maintain software for years and eventually you fix most of the bugs? Every year that passes is another year where programmers look at the code again and fix something "odd". Every month is one more chance to finally figure out that rare crash. Does this mean that over time there is less bad code.

Unfortunately, getting data is hard: most companies are secretive about their processes. Even when they are not, it is hard to do real science when you can't even get an apples to apples comparison, much less a true control case. We can look at open source for some hints, but need to be careful to not put too weight on the hints being a truth.

Some large open source projects have found that time works. Linux, Apache, and many others have a reputation of quality. When a few dedicated programers give years of their life to the code; and are allowed enough time to to make things better as opposed to just fixing critical bugs or adding new features: the code quality improves. Those programers start to know the entire system well enough that they are not afraid to make major changes for the better. If they are guided by tests (which they are moving to, though in general they don't have many yet) there is high likelihood that those changes won't introduce new bugs. In the open source world these people are generally the maintainers and have veto power over all contributions. When the right programer is given the power to refuse new money making features it does force the quality of the project to increase.

The open source model has failures as well. OpenSSL for example has been around for a while, and has been in the news recently for some serious bugs. The maintainers don't seem to care about quality, and so they have critical bugs that get worldwide attention. The project has implied that the money to fund the project comes from people who want features and don't care about bug free code, and so they don't worry about major bugs.

From what I can tell, the commercial world is a mix as well. On the one hand, pressures of money and schedule mean that often companies do not give programmers time to make required improvements On the other hand, many companies have learned the hard way that poor code quality is expensive as well. Companies normally say they want to strike a balance: if they don't limit the engineers nothing will get done and the company goes out of business, but if they don't put anything into maintainability the software ends up too expensive to maintain.

The idea of a balance is a good idea. However in my experience the idea is something that often gets lip service, but it isn't necessarily reality. It is easy to say "quality is important", but then day to day decisions prove that quality isn't important. There are always competitive pressures that make management want to get the current work done "quick and dirty" now, then come back "tomorrow" to clean it up. Then when tomorrow comes the next feature is more important. The solution here is easy: management needs to ensure they don't create emergencies from normal situation. Software is always late, quit pretending that is abnormal and manage it.

Even when management isn't directly standing in the way, programmers don't always make things better, instead they put a "band-aide" on which leaves the whole uglier, but gets the job done. Over time this causes the code to decay not get better. Even if the programmer cares, anything but the most trivial changes are hard to get in because there is the (legitimate) worry that the change might break something else.

There is one other part missing: many companies want to treat people like interchangeable parts, hire some contractors when there is work, switch people to a different project when the work is done. When you don't really understand the code in question you can't make an improvement: programmers that don't understand the big picture are likely to destroy a good design they don't understand in their misguided efforts to make things better.

The result of this over years: Eventually the company gives up and does an expensive big rewrite to "fix all the problems".

The question then becomes what is better long term: to start over from scratch every 10 years or so when the code becomes unmaintainable, or to spend extra time and money over those 10 years to keep the code maintainable. The big rewrite is expensive: not only do you pay most of the cost for writing the old system again at tomorrow's post inflation prices, but you also have to pay to maintain the old system until you can switch to the new system. Don't forget that the old system has become hard to maintain so you are paying extra money just for the work it needs. Or you could instead invest a little extra money every year just improving quality, so that things never reach the point where you need to do the big rewrite.

That might make it sound obvious that spending money on quality is the better idea, but this may not be true. Car manufactures tweak their cars every year, but they still have teams start over to do ground up re-designs of their cars because there is only so far tweaks can get them. Those ground up re-designs allow them to take advantage of new processes, styles, and materials that are not possible with the design of the existing car. Likewise in programing starting from the ground up is the only way to re-think many early core decisions. The important point is not to keep quality up so you never have to start over, but that starting over should be a business decision made in advance.

When you read the above your are probably thinking about Joel On Software's advice to never rewrite your software. However his advice is flawed. What is missing from his advice is that if you choose to re-write you need to commit to two fully funded teams for several years, one continuing on with the current code to keep it competitive in the market, the other to do the re-write. If you cannot afford this - very expensive cost - than a re-write is not for you. If you can afford it, it might be right. Over the very long run the rewrite can make you more money than sticking with the current code would. Where Netscape failed was not that they decided to do a re-write, it was that they didn't fund maintenance of the old code until the new was ready. Firefox is now a dominate browser, so while the full rewrite cost Netscape their company, they actually did well in over the long term, and it seems likely that the browser code they had before the re-write would not bring them here. (though one wonders if the situation would be different if Microsoft hadn't also abandoned their browser for many years giving Firefox time to get ahead)

Which model is right for your project? Are you going to maintain the same software year after year, slowly adding features as required, or are you going to start over every once in a while, creating a system that is suddenly better and lets you add more features fast for a time until it decays to an unmaintainable mess?

If you choose the first option, you have to accept that in the short run you may have to be late on useful new features just because they don't fit your quality standards without significant technical work. It also means you keep a few people for many years, always working on the same code. They need plenty of time for technical improvement. Those people need a passion for quality, and the right to make decisions that are wrong in the short term for the long term good.

If you choose the second, you can get to market faster in the short term: make it work now. Avoid refactoring, only make things better when there is no choice. The only parts of your code that are maintainable are the parts that change so often that you have to make them maintainable. As you go along everyone notes the mistakes they made, and the next time around you avoid making them. Then the team making the next version starts with a design that avoids those mistakes. I will add a caveat: guard against the second system effect.

There is one tricky part of the second plan that you need to work on carefully: the transition plan. Since you are planning on giving up on the current software, you need to ensure what when you write the new software your users can transfer over to it. Therefore your data storage formats need to be carefully documented, and you need to test that your data actually follows the format.

This isn't actually an either or choice. Just like in cars they generally use the same engines, you can choose to keep old parts of the system that are working well. In fact in every "ground up redesign" I've worked on, some parts of the old system were kept. When you have complex business logic that works it is often better to re-use that, wrapping it in a new UI and fixing the foundational architecture that is broken. The parts you aim to keep need to be kept maintainable, but not the whole.

Even if you choose the first, all is not well. We have not yet figured out the best way to program. If you started writing a program in 1955 for maintainability you will have real problems because in 1955 we didn't have any useful programing languages (See wikipidia's timeline of languages). If you started a few years latter COBAL probably seemed like a good language, but today we would disagree.

Choice of language isn't the only issue, we have also learned things about writing good programs. Structured programing solved a lot of major problems, that in 1955 we didn't anticipate. Object Oriented programing came next because structured code isn't enough. There are some who believe that functional programming will be next, though it hasn't caught on yet. Our current best practices leave much to be desired, but we don't yet know of anything better.

Then there are project specific issues. Early design decisions often prove to not scale well for some unanticipated need years latter. When those design decisions are the core of how your software is designed there is no way to retrofit without months or even years where the project is not shippable. It is often better to start over if you find yourself in this situation.

History says that you will still need to give up and re-write someday anyway. However with some care you can skip a rewrite, and overall save money. In some cases you can delay that until your project is obsolete and not needed anymore.

Which is right for your project? That is your choice. Only you know your pressures and long term situation that applies to your project.

Friday, February 13, 2015

What about testing like the shuttle?

Richard Feynman's appendex to the Challenger disaster is an interesting read. It is often taken as the best possible way to write quality code. The process you use when money is not as important as bug free code. One cannot argue with results, and that process got very close. What lessons should we take from their process? This post will focus on the following two paragraphs in particular.

The software is checked very carefully in a bottom-up fashion. First, each new line of code is checked, then sections of code or modules with special functions are verified. The scope is increased step by step until the new changes are incorporated into a complete system and checked. This complete output is considered the final product, newly released. But completely independently there is an independent verification group, that takes an adversary attitude to the software development group, and tests and verifies the software as if it were a customer of the delivered product. There is additional verification in using the new programs in simulators, etc. A discovery of an error during verification testing is considered very serious, and its origin studied very carefully to avoid such mistakes in the future. Such unexpected errors have been found only about six times in all the programming and program changing (for new or altered payloads) that has been done. The principle that is followed is that all the verification is not an aspect of program safety, it is merely a test of that safety, in a non-catastrophic verification. Flight safety is to be judged solely on how well the programs do in the verification tests. A failure here generates considerable concern.

To summarize then, the computer software checking system and attitude is of the highest quality. There appears to be no process of gradually fooling oneself while degrading standards so characteristic of the Solid Rocket Booster or Space Shuttle Main Engine safety systems. To be sure, there have been recent suggestions by management to curtail such elaborate and expensive tests as being unnecessary at this late date in Shuttle history. This must be resisted for it does not appreciate the mutual subtle influences, and sources of error generated by even small changes of one part of a program on another. There are perpetual requests for changes as new payloads and new demands and modifications are suggested by the users. Changes are expensive because they require extensive testing. The proper way to save money is to curtail the number of requested changes, not the quality of testing for each.
One might add that the elaborate system could be very much improved by more modern hardware and programming techniques. Any outside competition would have all the advantages of starting over, and whether that is a good idea for NASA now should be carefully considered.

As noted, this process does not (did not? it isn't clear what process NASA now uses) take advantage of modern techniques. In particular there doesn't seem to be automated testing. Of course the report was written in the 1980s, I don't think anyone had considered the idea back then (at least not in the from I'm interested in), so we can't fault NASA for not using it. However today we would want to automate the tests. This could save a significant amount of money over the years.

For all the careful attention and analysis, they had bugs. Nothing serious fortunately, but enough to remind us it isn't perfect either. Still this is generally considered the process that produced software with the least amount of bugs. It is also expensive, very few of us could convince our boss to write any software if we had to spend that much money to get it done. Of course most software failures won't become the defining story for an entire generation. Thus they don't demand that much attention. This cost issue is a business decision, but I'm not sure if any business has done a cost/benefit analysis to decide if they have the right compromise between quality and support. Here is the first lesson: the business people should make the decision on spending more money for quality vs more features.

Back to the description, one point really stands out: There appears to be no process of gradually fooling oneself while degrading standards. If this is true, it would be one of the greatest achievements of any management anywhere. It is easy to fool yourself into thinking what you just did was great when it is not. As a programmer, the most common way I've seen of fooling yourself is never using the combinations that don't work right. Always using a keyboard shortcut because the buttons are not reliable, or always turning on feature A before feature B because the opposite order doesn't work even though it should.

The attitude to test failures is interesting. A discovery of an error during verification testing is considered very serious, and its origin studied very carefully to avoid such mistakes in the future. This focus on test failures as embarrassing is unusual in the TDD world where all tests are written first. This may in fact be a fault of NASA's system: if you don't test your test how do you know they can detect a failure. TDD starts with a red test, and so it is clear that (at least originally) the test can detect a failure.

However that might be looking at testing wrong. NASA probably was dong some form of developer unit tests, and those probably did fail without a serious origin study. It seems likely that NASA expects their developers to not consider code done until they are sure it is perfect. In that context their testing makes more sense. The principle that is followed is that all the verification is not an aspect of program safety, it is merely a test of that safety. Despite the agile hype that testers are not required because everyone is a tester, I don't buy it. Manual verification by a third person is valuable. In Uncle Bob's talk about "the reasonable expectations of your CEO" he says "I want the testers to wonder why they have a job: they never find anything." NASA seems to have captured this long ago. Testers do a lot of work, but should never find anything.

In fact there is a separate test group at NASA: But completely independently there is an independent verification group. This group seems to be the one doing those tests that shouldn't fail. In fact: A failure here generates considerable concern. I would like more detail on what considerable concern means. In absence of knowledge I'm willing to speculate on two things that could be useful: 5 whys is a great technique to get to the bottom of any issue. Imagine if every issue the testers found generated a 5 whys response and got a root cause fixed.

There is a downside to all this though: From personal conversations with other programmers who have worked under similar processes, I've been told it is not fun to do work at NASA. The amount of time spent in review and process is boring and demotivating. The "cowboy programmer" has a lot more fun. To be fair, even while the "cowboy programmer" enjoys his job, most of them are well aware that the quality of software they produce is not acceptable.

What we really want is a the best of both worlds: high quality software and the fun "cowboy programmer" experience. If we can improve on NASA's quality that would be nice as well, but lets not fool ourselves into thinking that is easy. NASA managed to set a high bar without a lot of the modern things we know now.

So back to the question, should we learn anything from NASA, or is their process obsolete? The answer is we have things to learn. First, don't fool yourself. Second don't release software to test if there are any bugs, and if any find their way to test anyway treat it like a serious issue. Third, management should consider the tradeoffs and decide how much quality to pay for. If you make these your goal and work to make them realistic it will go a long way towards increasing the quality of your code.

Wednesday, January 21, 2015

Can we take lessons from mechanical engineering?

I have to preface this with the note that I'm not a mechanical engineer. I know a little about the subject, but not a whole lot. If you are a real mechanical engineer, please leave comments on what I have wrong. Otherwise take my illustration with a grain of salt.

Figuring out what seams to test at is hard on a complex project. Testing the completed project is too expensive, but if you test less than the full product you risk that your tests don't reflect how the parts actually work together. Can we look to other fields for help? To answer this question I looked a little a mechanical engineering to see what they do.

Probably the best description for building complex mechanical systems is from Richard Feynman's appendex to the Challenger disaster

The usual way that such engines are designed (for military or civilian aircraft) may be called the component system, or bottom-up design. First it is necessary to thoroughly understand the properties and limitations of the materials to be used (for turbine blades, for example), and tests are begun in experimental rigs to determine those. With this knowledge larger component parts (such as bearings) are designed and tested individually. As deficiencies and design errors are noted they are corrected and verified with further testing. Since one tests only parts at a time these tests and modifications are not overly expensive. Finally one works up to the final design of the entire engine, to the necessary specifications. There is a good chance, by this time that the engine will generally succeed, or that any failures are easily isolated and analyzed because the failure modes, limitations of materials, etc., are so well understood. There is a very good chance that the modifications to the engine to get around the final difficulties are not very hard to make, for most of the serious problems have already been discovered and dealt with in the earlier, less expensive, stages of the process.

There is of course more to design than can be expressed in one paragraph, but that gives us something to work with.

The first test mechanical engineering does is the properties of the material in question. In programming we start with standard libraries and algorithms and test them in isolation. We often test situations that cannot happen in the real world. We document the limitations. Unit tests do this very well. If you switch a turbine blade from zinc to aluminum, for the most part your same tests will pass, but some edge cases (max RPM before it explodes) will change, and we can predict with reasonable reliability what those changes are. Likewise when you switch from bubble sort to merge sort most things will work, but some edge cases (is the sort stable) will change.

One digression needs to be made here: mechanical engineering has margin factors. Material tests is generally for an ideal case of the material as formed in a laboratory. In the real world of manufacturing the same quality might not be obtainable - for example there might be "bubbles" or "cracks" in the real part. (for materials prone to this, that is part of the list of properties). Some of this is solved by inspecting parts in manufacturing, but some of it is solved by specifying parts that in theory can handle more stress than they will actually see. This margin is adjusted depending on needs, large expensive parts will in general have minimal factors, while parts that could kill someone will often be three times "stronger" than required just in case. I don't know of an analog for programing.

From there they create larger components, and assemble those components into larger and larger components. Programs are also made of components, which are then put together into larger and larger parts. In both cases you eventually arrive at a whole. So long as only minor changes are required there is no problem, as long as the parts fit together you can change the pieces inside without problem, just test that part and all the parts higher up.

The most important lesson is if your change to one part changes the way it connects with the next component, you also need to re-design that component to fit, which means a lot of parts need to be retested. Mechanical engineers have a concept of interchangeable parts. I upgraded the clutch on my car with one from a model 4 years older for a different engine: the upgrade just bolted right to my engine and transmission and so the change was fairly simple even through large parts of my car had been re-designed.

In programming we have figured out how to substitute simple algorithms like sort and a few containers. However we don't yet have a concept of how to substitute larger pieces in general. There are plug in architectures, but they are plug in only at a few levels. You can run many different programs, but the programs themselves are generally either monolithic, or only allow a few areas of change. We can trade out device drivers, and a few other areas, but those tend to be places where our program touches something external. What is your "clutch" that is a part of your program, yet can be traded out? I'm not sure if this is actually a valid question though: my son's toys generally have one changeable part: the battery. Maybe most software only has a few plug ins because changing plug-ins is not useful for most software.

Lets go back to Richard Feynman's description. I have a problem with his approach. He says you start with a turbine blade, but why are we using a turbine in the first place? Those model rockets I made as a kid didn't have one. Of course my model rocket used a different fuel source, but that is the point, before you can decide you need turbine at all you need to know what fuel you will use, and that is top down design not bottom up. You need to know a lot of your design before hand: a turbine that won't fit is a design disaster even if it works well in isolation. This is exactly the same problem I'm facing in programing: eventually in a large system you will design two components that don't quite fit together right, and fixing it means you have to do major work to one or both parts.

In conclusion, Mechanical Engineering appears to face many of the problems we do. There is no silver bullet though: engineering is a hard problem an you have to make compromises, refine guesses and so on - until you get something that works.

Tuesday, January 6, 2015

Are integration tests the answer?

I have been picking on unit tests a lot lately. The obvious question is do I think that integration testing the answer?

Before I can answer this, we need a definition of integration test. Just like unit test, the definition of integration test goes back to long before automated tests. A integration test is any test that combines two or more units in a single test, with the purpose of testing interaction between units. Many other authors have attempted to redefine integration tests to something that makes more sense in an automation test world.

Back to the question, what about more integration tests? The standard answer is no: when an integration test fails there are many possible reasons, which means you waste time trying to figure out where things broke. It is agreed that when a test fails you should know exactly what part of code broke. Since an integration test covers a large part of the code the failure could be anywhere.

I question that answer. Sure in theory code could break for many reasons. However in the real world there is exactly one reason a test failed: the code you have touched in the last minute broke something. The point of automated tests is we run them all the time - several times a minute is the goal, and once a minute is common. Even the fastest typists cannot write much code in a minute: this leaves a tiny number of places to look for the failure. If a large integration test breaks you already have the root cause isolated to a couple lines of code. As a bonus that area is in your short term memory! (sometimes the solution is changing code elsewhere which is hard, but where the problem was introduced is obvious)

Unfortunately there are other problems with integration tests that are also used as reasons not to write them. These reasons are valid, and you need to understand the tradeoffs in detail before you write any tests.

The first problem with integration tests is they tend to run long. If you cannot run your entire test suite in 10 seconds (or less!) you need to do something. I might write about this latter, but a short (probably incomplete) summary of things to do. Use your build system to only run tests that actually test the code you change. Profile your tests and make them run faster. Split them into suites that can run in parallel. Split them into parts with a scheme where some run all the time, some less often. Use these tricks to get your test times down. There is one more option that deserves discussion: test smaller areas of code, this can get your test times down - at the expense of all the problems of unit tests.

A second problem is integration tests are fragile because of time. Time is easy to isolate in units - most of which don't use time anyway, but the larger the integration the more likely it is that something will fail because of time issues. I outlined a potential solution in my testing timers post, but this may not work for you.

Requirements change over time. This change is more likely to hit integration tests because you are testing a large part of the system. When your tests are tiny, there is correspondingly only a few possible ways the test can break. Larger tests have a larger surface to break when something changes. Thus integration tests are more subject to change. This is not always bad: sometimes a new feature cannot work with some existing feature, and the failing test is the first time anyone realizes the subtle I reason why. Failing tests are often a sign you need a conversation with the business analysts to understand what should happen.

An important variation of the above, the user interface is likely to change often. When you have a feature working you are unlikely to change the code behind it. However the UI for feature not only has to let you use the feature, it also needs to look nice. Look nice is subjective style which changes over time. If your tests all depend on looking for a particular shade of yellow then every time tastes change, a bunch of tests need to change. A partial solution to this is get a UI test framework that knows about the objects on the screen, instead of looking for pixel positions and colors. The objects will change much less often, but even this isn't enough, a widget might move to a whole new screen which again can break a lot of tests.

Fragile can also mean the integration test doesn't inject test doubles like a unit test would. It is very useful to write these tests: the only way to know if you are using an API correctly is to use it. However anytime an API is used instead of a test double you take the risk that something real might/might not be there that breaks the test. A test that needs a special environment can be useful to ensure that your code works in that environment, but it also means anyone without that environment cannot run that test this is a tradeoff you need to evaluate yourself.

Perhaps the biggest problem with integration tests is sometimes you know an error is possible, but not how to create it. For example, filling your disk with data just to test disk full errors isn't very friendly. Worse there may not be a way to ensure the disk fills up at the right time leaving some error paths not tested. This is only one of the many possible errors you can get in a real world system that you need to handle, but probably cannot simulate well without test doubles.

The above is probably not even a complete list.

So do I think you should write integration tests? No, but only because the definition is too tied to the manual testing/unit testing world. What we need is something like integration tests, but without the above problems. There is no silver bullet here: at some time, no matter which testing philosophy you use, you will encounter a limit where you need to examine trade offs and decide what price to pay.

Monday, December 15, 2014

Don't automate your unit tests

I first started doing unit tests about 10 years before I heard about the idea of automated tests. It was understood that unit tests were the programer testing small sections of code in isolation. Who better to test all the logic works than the programmer who did the work? You wrote a test program to give each function the input you cared about, and inspected the state of the system was what you expected. It worked great, I remember finding many subtle bugs in code that I thought was right until I noticed the internal state of my system wasn't right. Once you were done with each test you modified your fixture to test the next input, without saving your test program. When you finished you deleted the test program.

The above was often called white box testing, as opposed to black box testing. In white box testing the person doing the testing knows how the system is implemented and takes advantage of that fact to write tests. By contrast, in black box testing the person doing the testing doesn't know the implementation, they are supposed to verify it works as it should. Having both types of manual tests was considered best practice.

If you are switching from the above work to automated tests it might seem easy an obvious that you should take the test your wrote previously and save them. It turns out that this is actually wrong. Throwing away those tests was the right thing to do. Even though they found many bugs, they are not the types of tests you want long term. As I wrote previously, the reason we automate tests is to support future changes to the code. Unit tests don't support the future of the code, they verify the current state. This subtle distinction means that those old unit tests are not useful.

The future of your code is change: new features, bug fixes, performance improvements and refactoring, each will change the internal state of your program and thus break your unit tests. If those changes break your automated tests the only thing to do is throw the tests out and write new. Even if you start with the old tests and make minor tweaks you should still consider them brand new tests. This is not what you want. When you fix a bug you should add new tests but otherwise not change any tests.When you refactor or make a performance improvement: there should be zero tests that change. When you add a feature you might find a couple tests that are in conflict with the new feature and are deleted, but overall you should be adding tests. Unit tests do none of the above: because unit tests know the internal state of your system they are the wrong answer for the future.

This does not mean unit tests are useless. I often insert printfs or debugger breakpoints into my code while trying to understand a bug. Many languages have some form of ASSERT() which will abort the program if some condition is met. Those are all forms of unit tests. However the fact that I know some details as they are now doesn't mean I should test those details automatically.

When writing automated tests you need take advantage of your knowledge of the system to ensure that all of the internal states are tested. However you must not actually check the internal state of the system to ensure it is correct. Instead you need to figure out what the external state of the system should be based on that state. Often this means you combine what would have been several different unit tests into one test.

Is there an exception to this rule? Code to prevent a cross thread race condition cannot be tested by external behavior, but it still needs to work. You can try running the test a million times, but even that isn't guaranteed to hit the race condition once. I hesitate to allow testing internal state evcases like this, the locks required to make the code thread safe should not be held longer than required. If the lock remains long enough for your test to check that you are holding them, the locks are held too long for good performance.

Friday, December 5, 2014

Tests and Architecture

This post is a follow up to my previous on testing implementation details.

I believe that when you support something you should take extra care to ensure that you are not blinding yourself to the downsides. Since I'm a proponent of testing software, and TDD as the way to do that, it is good to ask: what are the downsides.

There is one obvious downside that I'm aware of: tests to not drive good architecture. Even though TDD is sometimes called Test Driven Design (Test Driven Development is also used), TDD doesn't actually drive good design! Sure it works well for tiny programs. However I have found that for large programs where many programmers (more than 10) are working full time you cannot rely on just TDD to get architecture: you need something more.

In the well known TDD bowling game example the authors talk about a list of frame objects that they expected to need in the early stages. However as their program matured over the exercise they never created one. Many people have done the same example, with the same tests, and ended up with a a list of frames. Which is right? For the requirements given both are equally correct. However the requirements are such that 1 pair could write the entire program in a of couple hours.

So lets imagine what a real bowling program would look like if we worked for Brunswick. If you haven't been bowling in a few years you should go to their website and investigate a modern bowling center. Bowling is not just about keeping score, they need to track which pins are still standing, the clerk needs to turn lanes on/off when they are paid for. There is a terminal for interaction (entering bowler's names), and a second screen that displays the score, or funky animations when a strike is bowled. There is the pin setting machine that we can assume has a computer of its own. There are probably a number of other requirements for different events that aren't obvious from the website, but important anyway. The need to add these additional features places requirements on architecture that do not exist in the example. Do any of these requirements demand a list of frames?

Now many of you are thinking YAGNI, that is only partially correct. Even most extreme proponents will tell you that it cannot work without continuous refactoring. You can look at something that is done, ask "is this right" and re-design the architecture to fit. The entire cycle is only a few hours long: the full sequence of steps end in refactor, and part of refactor is create good architecture. When you have something that works you can step back and say "now that I have experience, what is a good design" instead of trying to guess. This is great for getting the classes right in the small, much better than up-front design. However I contend that YAGNI alone isn't enough for a large project.

YAGNI hits a wall on large projects. At some point you will realize someone over on the other side of the office (often world!) did something similar to something that you have also done; and both of you did it a few months back and have been using your versions all over. The right thing to do is refactor to remove the duplication, but how do you do that when the API is slightly different and both are used all over? Creating modules is easy if the API in question was identified correctly in advance, but if you see the need late it can be difficult. Tests do not help to create the right modules in advance. Tests can help you add those modules latter when you find a need, but they often will actually hinder adding those modules

So the only question is how do you identify those modules in advance? Fortunately for me, this is a blog on testing. Therefore I can leave the discussion of how to figure out where you need those modules to someone else...

Unfortunately I can't just leave you with that cop out, nobody sensible would accept it. I'm also out of my league. There are the obvious places where the boundary exists, but those are also easy and probably done. You would be a fool to write your own string implementation for example (I'm assuming your project isn't write the standard library for your language). There are somewhat less obvious places in your project that you can analyze in advance with just a little spike work. If you haven't done the one to throw away first you should, a couple weeks effort can save you months in many ones, one of which is finding a number of modules. However until you get into deep into the real implementation there will be something you miss.

Those things you miss are the important parts. If you mocked any of them switching will be hard. You have to change both the test and the implementation at the same time, then all of the tests to use a different mock and then everyplace in the code that calls that mock to the new API. When you change code and tests at the same time there is always the danger that you make a change that is wrong, but the tests don't catch it. When you have many tests broken there is a long time between feedback, what if you break real functionality, you may go for hours making tests pass before you notice the one failing test that is a real bug, which means you are hours from the code change where the problem came in.

Back to our Brunswick bowling machine. As the example played out, they have a game object which you can ask CurrentFrame, and also GetScoreForFrame. This interface might need some extension, be you can quickly see that game is the heart of any analysis of score and so that this object is what is passed around. The implementation behind it is hidden by the game abstraction: this looks like a good choice to me. Or does it - what if we are interested in looping across all throws? Now the game object gets a couple new functions: GetNumberOfThrowsInFrame and GetThrowForFrame. However this is two loops which is unexpected complexity, we have to loop across each frame and then across the throws in the frame.

We can also imagine a different architecture where instead of a game object as the abstraction we use the list of frames, each frame is 3 throws - and a flag to indicate which throws were actually taken. Looking on list of frames is easy, and the code to loop across all throws is clear. This list is also cache efficient which is useful on modern CPUs that pay a big price for cache misses. On the other hand we are passing data around which means if we ever want to change our the data structure we need to change every place in the code that reads it - this is why data is left private in nearly every architecture.

Note that you are not limited to either list of frames or a game object. There are many other ways to design your system. Each has different trade offs. Maybe the real problem is I created a poor API for getting throws from the game object? Or maybe the problem is the game object is the abstraction: perhaps a different object (or set of objects) can provide the right abstraction. I leave the correct way to solve this problem up to your imagination.

In short, TDD is not a silver bullet that lets you fire all the architects. You still need their vision to get the interfaces right.

If there is any other downside to automated testing you think I've missed, please let me know.

Wednesday, November 26, 2014

What is Code Coverage worth?

I've run across a few people say code coverage is the reason to write automated tests. When you are done you can run some tool and see that you have 92.5385% coverage (as if all those digits are statistically significant). I have always questioned this claim: numbers are meaningless.

Nobody has actually told me why I should care about the number. Oh sure, we can all agree that 17% looks pretty bad, and 95% looks pretty good, but so what. If you are that 17% what are you going to do about it? I know a number of successful products that seem pretty stable that are much worse than 20% - in fact nearly much every program up until 2005 has 0 coverage and nobody knew to care. I also know of code with > 90% coverage that had (has?) significant bugs.

People who use code coverage tell me that it is useful when:

You have a new team member. Looking at coverage for the code he creates you know if he is creating tests. You can have a conversation about team expectations if test coverage is not near to what the rest of the team is doing.

You think you are doing TDD - if you see less than 100% coverage it means you didn't TDD that code. These are the places to go back and delete the code until you write the failing test case that requires it.

You have a legacy system with some tests. By considering coverage and relative risk you can decide which areas to put technical focus on first. Pragmatically you know that when working with legacy code you cannot fix everything today (customers won't buy technically better, they buy new features), so you need to prioritize. You should be able to get some time/money from the business for technical work (if you can't you have other problems): coverage is an input in the decision of what to work on with that time.

I'll be honest, I've only heard of teams successfully using coverage for the above. I've never seen any of that in a project I've been one. By contrast I have seen all of the following ways that code coverage can be bad. Code coverage is dangerous if:

You think coverage code is a sign of quality. It is very easy to write tests that cover a line of code without actually testing that the line of code works. The following trivial example gets 100% coverage, but the code is wrong.

int add(int a, int b) { return 1;}

TEST(add) { add(2,2);}

You assign a target number. This is partially related to the above, people will sometimes write bad tests just to get the target. Ignoring that, some code needs more testing than others. Some code just cannot be usefully tested - multi-threaded code may have a bunch of untested mutexes. In a compiled language Getters and Setters don't need testing - how can they possibly fail. In the first case you use careful code review because nobody has figured out how to write useful tests, in the second nobody cares because the code can't possible be wrong. On the other hand, for single threaded business logic classes 100% code coverage shouldn't be hard, and so you should go above the target.

Your boss knows what the numbers are. This is a variation of both of the above with a twist worth noting: because the boss is looking you should expect to see coverage numbers in your yearly reviews, and they might affect your pay. While not morally justified, it is fully understandable why you would write extra "tests" just to ensure you are exceeding his target coverage numbers even if the tests are of no value.

Given that I've never seen a good use for coverage, I have to ask why we bother to measure it. I would advise you not to measure coverage on a YAGNI basis: if you ever do decide you will use it usefully, you can measure it latter.