The Well Tested Programmer: December 2014

I first started doing unit tests about 10 years before I heard about the idea of automated tests. It was understood that unit tests were the programer testing small sections of code in isolation. Who better to test all the logic works than the programmer who did the work? You wrote a test program to give each function the input you cared about, and inspected the state of the system was what you expected. It worked great, I remember finding many subtle bugs in code that I thought was right until I noticed the internal state of my system wasn't right. Once you were done with each test you modified your fixture to test the next input, without saving your test program. When you finished you deleted the test program.

The above was often called white box testing, as opposed to black box testing. In white box testing the person doing the testing knows how the system is implemented and takes advantage of that fact to write tests. By contrast, in black box testing the person doing the testing doesn't know the implementation, they are supposed to verify it works as it should. Having both types of manual tests was considered best practice.

If you are switching from the above work to automated tests it might seem easy an obvious that you should take the test your wrote previously and save them. It turns out that this is actually wrong. Throwing away those tests was the right thing to do. Even though they found many bugs, they are not the types of tests you want long term. As I wrote previously, the reason we automate tests is to support future changes to the code. Unit tests don't support the future of the code, they verify the current state. This subtle distinction means that those old unit tests are not useful.

The future of your code is change: new features, bug fixes, performance improvements and refactoring, each will change the internal state of your program and thus break your unit tests. If those changes break your automated tests the only thing to do is throw the tests out and write new. Even if you start with the old tests and make minor tweaks you should still consider them brand new tests. This is not what you want. When you fix a bug you should add new tests but otherwise not change any tests.When you refactor or make a performance improvement: there should be zero tests that change. When you add a feature you might find a couple tests that are in conflict with the new feature and are deleted, but overall you should be adding tests. Unit tests do none of the above: because unit tests know the internal state of your system they are the wrong answer for the future.

This does not mean unit tests are useless. I often insert printfs or debugger breakpoints into my code while trying to understand a bug. Many languages have some form of ASSERT() which will abort the program if some condition is met. Those are all forms of unit tests. However the fact that I know some details as they are now doesn't mean I should test those details automatically.

When writing automated tests you need take advantage of your knowledge of the system to ensure that all of the internal states are tested. However you must not actually check the internal state of the system to ensure it is correct. Instead you need to figure out what the external state of the system should be based on that state. Often this means you combine what would have been several different unit tests into one test.

Is there an exception to this rule? Code to prevent a cross thread race condition cannot be tested by external behavior, but it still needs to work. You can try running the test a million times, but even that isn't guaranteed to hit the race condition once. I hesitate to allow testing internal state evcases like this, the locks required to make the code thread safe should not be held longer than required. If the lock remains long enough for your test to check that you are holding them, the locks are held too long for good performance.

This post is a follow up to my previous on testing implementation details.

I believe that when you support something you should take extra care to ensure that you are not blinding yourself to the downsides. Since I'm a proponent of testing software, and TDD as the way to do that, it is good to ask: what are the downsides.

There is one obvious downside that I'm aware of: tests to not drive good architecture. Even though TDD is sometimes called Test Driven Design (Test Driven Development is also used), TDD doesn't actually drive good design! Sure it works well for tiny programs. However I have found that for large programs where many programmers (more than 10) are working full time you cannot rely on just TDD to get architecture: you need something more.

In the well known TDD bowling game example the authors talk about a list of frame objects that they expected to need in the early stages. However as their program matured over the exercise they never created one. Many people have done the same example, with the same tests, and ended up with a a list of frames. Which is right? For the requirements given both are equally correct. However the requirements are such that 1 pair could write the entire program in a of couple hours.

So lets imagine what a real bowling program would look like if we worked for Brunswick. If you haven't been bowling in a few years you should go to their website and investigate a modern bowling center. Bowling is not just about keeping score, they need to track which pins are still standing, the clerk needs to turn lanes on/off when they are paid for. There is a terminal for interaction (entering bowler's names), and a second screen that displays the score, or funky animations when a strike is bowled. There is the pin setting machine that we can assume has a computer of its own. There are probably a number of other requirements for different events that aren't obvious from the website, but important anyway. The need to add these additional features places requirements on architecture that do not exist in the example. Do any of these requirements demand a list of frames?

Now many of you are thinking YAGNI, that is only partially correct. Even most extreme proponents will tell you that it cannot work without continuous refactoring. You can look at something that is done, ask "is this right" and re-design the architecture to fit. The entire cycle is only a few hours long: the full sequence of steps end in refactor, and part of refactor is create good architecture. When you have something that works you can step back and say "now that I have experience, what is a good design" instead of trying to guess. This is great for getting the classes right in the small, much better than up-front design. However I contend that YAGNI alone isn't enough for a large project.

YAGNI hits a wall on large projects. At some point you will realize someone over on the other side of the office (often world!) did something similar to something that you have also done; and both of you did it a few months back and have been using your versions all over. The right thing to do is refactor to remove the duplication, but how do you do that when the API is slightly different and both are used all over? Creating modules is easy if the API in question was identified correctly in advance, but if you see the need late it can be difficult. Tests do not help to create the right modules in advance. Tests can help you add those modules latter when you find a need, but they often will actually hinder adding those modules

So the only question is how do you identify those modules in advance? Fortunately for me, this is a blog on testing. Therefore I can leave the discussion of how to figure out where you need those modules to someone else...

Unfortunately I can't just leave you with that cop out, nobody sensible would accept it. I'm also out of my league. There are the obvious places where the boundary exists, but those are also easy and probably done. You would be a fool to write your own string implementation for example (I'm assuming your project isn't write the standard library for your language). There are somewhat less obvious places in your project that you can analyze in advance with just a little spike work. If you haven't done the one to throw away first you should, a couple weeks effort can save you months in many ones, one of which is finding a number of modules. However until you get into deep into the real implementation there will be something you miss.

Those things you miss are the important parts. If you mocked any of them switching will be hard. You have to change both the test and the implementation at the same time, then all of the tests to use a different mock and then everyplace in the code that calls that mock to the new API. When you change code and tests at the same time there is always the danger that you make a change that is wrong, but the tests don't catch it. When you have many tests broken there is a long time between feedback, what if you break real functionality, you may go for hours making tests pass before you notice the one failing test that is a real bug, which means you are hours from the code change where the problem came in.

Back to our Brunswick bowling machine. As the example played out, they have a game object which you can ask CurrentFrame, and also GetScoreForFrame. This interface might need some extension, be you can quickly see that game is the heart of any analysis of score and so that this object is what is passed around. The implementation behind it is hidden by the game abstraction: this looks like a good choice to me. Or does it - what if we are interested in looping across all throws? Now the game object gets a couple new functions: GetNumberOfThrowsInFrame and GetThrowForFrame. However this is two loops which is unexpected complexity, we have to loop across each frame and then across the throws in the frame.

We can also imagine a different architecture where instead of a game object as the abstraction we use the list of frames, each frame is 3 throws - and a flag to indicate which throws were actually taken. Looking on list of frames is easy, and the code to loop across all throws is clear. This list is also cache efficient which is useful on modern CPUs that pay a big price for cache misses. On the other hand we are passing data around which means if we ever want to change our the data structure we need to change every place in the code that reads it - this is why data is left private in nearly every architecture.

Note that you are not limited to either list of frames or a game object. There are many other ways to design your system. Each has different trade offs. Maybe the real problem is I created a poor API for getting throws from the game object? Or maybe the problem is the game object is the abstraction: perhaps a different object (or set of objects) can provide the right abstraction. I leave the correct way to solve this problem up to your imagination.

In short, TDD is not a silver bullet that lets you fire all the architects. You still need their vision to get the interfaces right.

If there is any other downside to automated testing you think I've missed, please let me know.

The Well Tested Programmer

Monday, December 15, 2014

Don't automate your unit tests

Friday, December 5, 2014

Tests and Architecture