The Well Tested Programmer: 2014

Monday, December 15, 2014

Don't automate your unit tests

I first started doing unit tests about 10 years before I heard about the idea of automated tests. It was understood that unit tests were the programer testing small sections of code in isolation. Who better to test all the logic works than the programmer who did the work? You wrote a test program to give each function the input you cared about, and inspected the state of the system was what you expected. It worked great, I remember finding many subtle bugs in code that I thought was right until I noticed the internal state of my system wasn't right. Once you were done with each test you modified your fixture to test the next input, without saving your test program. When you finished you deleted the test program.

The above was often called white box testing, as opposed to black box testing. In white box testing the person doing the testing knows how the system is implemented and takes advantage of that fact to write tests. By contrast, in black box testing the person doing the testing doesn't know the implementation, they are supposed to verify it works as it should. Having both types of manual tests was considered best practice.

If you are switching from the above work to automated tests it might seem easy an obvious that you should take the test your wrote previously and save them. It turns out that this is actually wrong. Throwing away those tests was the right thing to do. Even though they found many bugs, they are not the types of tests you want long term. As I wrote previously, the reason we automate tests is to support future changes to the code. Unit tests don't support the future of the code, they verify the current state. This subtle distinction means that those old unit tests are not useful.

The future of your code is change: new features, bug fixes, performance improvements and refactoring, each will change the internal state of your program and thus break your unit tests. If those changes break your automated tests the only thing to do is throw the tests out and write new. Even if you start with the old tests and make minor tweaks you should still consider them brand new tests. This is not what you want. When you fix a bug you should add new tests but otherwise not change any tests.When you refactor or make a performance improvement: there should be zero tests that change. When you add a feature you might find a couple tests that are in conflict with the new feature and are deleted, but overall you should be adding tests. Unit tests do none of the above: because unit tests know the internal state of your system they are the wrong answer for the future.

This does not mean unit tests are useless. I often insert printfs or debugger breakpoints into my code while trying to understand a bug. Many languages have some form of ASSERT() which will abort the program if some condition is met. Those are all forms of unit tests. However the fact that I know some details as they are now doesn't mean I should test those details automatically.

When writing automated tests you need take advantage of your knowledge of the system to ensure that all of the internal states are tested. However you must not actually check the internal state of the system to ensure it is correct. Instead you need to figure out what the external state of the system should be based on that state. Often this means you combine what would have been several different unit tests into one test.

Is there an exception to this rule? Code to prevent a cross thread race condition cannot be tested by external behavior, but it still needs to work. You can try running the test a million times, but even that isn't guaranteed to hit the race condition once. I hesitate to allow testing internal state evcases like this, the locks required to make the code thread safe should not be held longer than required. If the lock remains long enough for your test to check that you are holding them, the locks are held too long for good performance.

Friday, December 5, 2014

Tests and Architecture

This post is a follow up to my previous on testing implementation details.

I believe that when you support something you should take extra care to ensure that you are not blinding yourself to the downsides. Since I'm a proponent of testing software, and TDD as the way to do that, it is good to ask: what are the downsides.

There is one obvious downside that I'm aware of: tests to not drive good architecture. Even though TDD is sometimes called Test Driven Design (Test Driven Development is also used), TDD doesn't actually drive good design! Sure it works well for tiny programs. However I have found that for large programs where many programmers (more than 10) are working full time you cannot rely on just TDD to get architecture: you need something more.

In the well known TDD bowling game example the authors talk about a list of frame objects that they expected to need in the early stages. However as their program matured over the exercise they never created one. Many people have done the same example, with the same tests, and ended up with a a list of frames. Which is right? For the requirements given both are equally correct. However the requirements are such that 1 pair could write the entire program in a of couple hours.

So lets imagine what a real bowling program would look like if we worked for Brunswick. If you haven't been bowling in a few years you should go to their website and investigate a modern bowling center. Bowling is not just about keeping score, they need to track which pins are still standing, the clerk needs to turn lanes on/off when they are paid for. There is a terminal for interaction (entering bowler's names), and a second screen that displays the score, or funky animations when a strike is bowled. There is the pin setting machine that we can assume has a computer of its own. There are probably a number of other requirements for different events that aren't obvious from the website, but important anyway. The need to add these additional features places requirements on architecture that do not exist in the example. Do any of these requirements demand a list of frames?

Now many of you are thinking YAGNI, that is only partially correct. Even most extreme proponents will tell you that it cannot work without continuous refactoring. You can look at something that is done, ask "is this right" and re-design the architecture to fit. The entire cycle is only a few hours long: the full sequence of steps end in refactor, and part of refactor is create good architecture. When you have something that works you can step back and say "now that I have experience, what is a good design" instead of trying to guess. This is great for getting the classes right in the small, much better than up-front design. However I contend that YAGNI alone isn't enough for a large project.

YAGNI hits a wall on large projects. At some point you will realize someone over on the other side of the office (often world!) did something similar to something that you have also done; and both of you did it a few months back and have been using your versions all over. The right thing to do is refactor to remove the duplication, but how do you do that when the API is slightly different and both are used all over? Creating modules is easy if the API in question was identified correctly in advance, but if you see the need late it can be difficult. Tests do not help to create the right modules in advance. Tests can help you add those modules latter when you find a need, but they often will actually hinder adding those modules

So the only question is how do you identify those modules in advance? Fortunately for me, this is a blog on testing. Therefore I can leave the discussion of how to figure out where you need those modules to someone else...

Unfortunately I can't just leave you with that cop out, nobody sensible would accept it. I'm also out of my league. There are the obvious places where the boundary exists, but those are also easy and probably done. You would be a fool to write your own string implementation for example (I'm assuming your project isn't write the standard library for your language). There are somewhat less obvious places in your project that you can analyze in advance with just a little spike work. If you haven't done the one to throw away first you should, a couple weeks effort can save you months in many ones, one of which is finding a number of modules. However until you get into deep into the real implementation there will be something you miss.

Those things you miss are the important parts. If you mocked any of them switching will be hard. You have to change both the test and the implementation at the same time, then all of the tests to use a different mock and then everyplace in the code that calls that mock to the new API. When you change code and tests at the same time there is always the danger that you make a change that is wrong, but the tests don't catch it. When you have many tests broken there is a long time between feedback, what if you break real functionality, you may go for hours making tests pass before you notice the one failing test that is a real bug, which means you are hours from the code change where the problem came in.

Back to our Brunswick bowling machine. As the example played out, they have a game object which you can ask CurrentFrame, and also GetScoreForFrame. This interface might need some extension, be you can quickly see that game is the heart of any analysis of score and so that this object is what is passed around. The implementation behind it is hidden by the game abstraction: this looks like a good choice to me. Or does it - what if we are interested in looping across all throws? Now the game object gets a couple new functions: GetNumberOfThrowsInFrame and GetThrowForFrame. However this is two loops which is unexpected complexity, we have to loop across each frame and then across the throws in the frame.

We can also imagine a different architecture where instead of a game object as the abstraction we use the list of frames, each frame is 3 throws - and a flag to indicate which throws were actually taken. Looking on list of frames is easy, and the code to loop across all throws is clear. This list is also cache efficient which is useful on modern CPUs that pay a big price for cache misses. On the other hand we are passing data around which means if we ever want to change our the data structure we need to change every place in the code that reads it - this is why data is left private in nearly every architecture.

Note that you are not limited to either list of frames or a game object. There are many other ways to design your system. Each has different trade offs. Maybe the real problem is I created a poor API for getting throws from the game object? Or maybe the problem is the game object is the abstraction: perhaps a different object (or set of objects) can provide the right abstraction. I leave the correct way to solve this problem up to your imagination.

In short, TDD is not a silver bullet that lets you fire all the architects. You still need their vision to get the interfaces right.

If there is any other downside to automated testing you think I've missed, please let me know.

Wednesday, November 26, 2014

What is Code Coverage worth?

I've run across a few people say code coverage is the reason to write automated tests. When you are done you can run some tool and see that you have 92.5385% coverage (as if all those digits are statistically significant). I have always questioned this claim: numbers are meaningless.

Nobody has actually told me why I should care about the number. Oh sure, we can all agree that 17% looks pretty bad, and 95% looks pretty good, but so what. If you are that 17% what are you going to do about it? I know a number of successful products that seem pretty stable that are much worse than 20% - in fact nearly much every program up until 2005 has 0 coverage and nobody knew to care. I also know of code with > 90% coverage that had (has?) significant bugs.

People who use code coverage tell me that it is useful when:

You have a new team member. Looking at coverage for the code he creates you know if he is creating tests. You can have a conversation about team expectations if test coverage is not near to what the rest of the team is doing.

You think you are doing TDD - if you see less than 100% coverage it means you didn't TDD that code. These are the places to go back and delete the code until you write the failing test case that requires it.

You have a legacy system with some tests. By considering coverage and relative risk you can decide which areas to put technical focus on first. Pragmatically you know that when working with legacy code you cannot fix everything today (customers won't buy technically better, they buy new features), so you need to prioritize. You should be able to get some time/money from the business for technical work (if you can't you have other problems): coverage is an input in the decision of what to work on with that time.

I'll be honest, I've only heard of teams successfully using coverage for the above. I've never seen any of that in a project I've been one. By contrast I have seen all of the following ways that code coverage can be bad. Code coverage is dangerous if:

You think coverage code is a sign of quality. It is very easy to write tests that cover a line of code without actually testing that the line of code works. The following trivial example gets 100% coverage, but the code is wrong.

int add(int a, int b) { return 1;}

TEST(add) { add(2,2);}

You assign a target number. This is partially related to the above, people will sometimes write bad tests just to get the target. Ignoring that, some code needs more testing than others. Some code just cannot be usefully tested - multi-threaded code may have a bunch of untested mutexes. In a compiled language Getters and Setters don't need testing - how can they possibly fail. In the first case you use careful code review because nobody has figured out how to write useful tests, in the second nobody cares because the code can't possible be wrong. On the other hand, for single threaded business logic classes 100% code coverage shouldn't be hard, and so you should go above the target.

Your boss knows what the numbers are. This is a variation of both of the above with a twist worth noting: because the boss is looking you should expect to see coverage numbers in your yearly reviews, and they might affect your pay. While not morally justified, it is fully understandable why you would write extra "tests" just to ensure you are exceeding his target coverage numbers even if the tests are of no value.

Given that I've never seen a good use for coverage, I have to ask why we bother to measure it. I would advise you not to measure coverage on a YAGNI basis: if you ever do decide you will use it usefully, you can measure it latter.

Thursday, November 20, 2014

Don't test implementation details.

Those who have the misfortune to have me on their code reviews see me repeat this often. Don't test implementation details.

What do I mean? In short, don't test something you are likely to refactor latter. Find a test point that isn't likely to change. All too often I see someone write a seemingly useful test, but it locks them into a design that shouldn't be locked in.

Mocks are the primary way I see this rule violated. Your code calls a function with some arguments, so you have your mock check for those. However there are often several different ways to use the function that in the end give the same result. Your tests should support changing implementations to see what is better.

For example, the standard file read function has a length argument, if you mock this, your mock needs to be powerful enough to handle calling it once with a buffer size of 1000 bytes, or twice with a 500 byte buffer. Both are equally valid, and depending one how you use the data and performance considerations one night be better. As hardware changes which is better may change. With good tests you can try many different read sizes to see if any make a difference in the real world.

Read the above again, but not with this in mind: Computers have been doing file IO for many years. We understand it well. It is very unlikely someone will come up with a new way to do file IO that is completely different from what your program supports today, and compellingly enough better that you would change your program to use it.

Contrast to your typical program. The internal API is very much in flux because nobody knows what the right way to solve the problem is. It seems to be taken as a matter of faith that you should break your tests up, but where do you break things up? Once you break something up you won't touch it that API again. Not that you can't touch it, just that anyone who does will be afraid of all the tests that break and give up instead.

One role of architecture is figure this out, but they only get a part of the work. Anytime architecture gives you a well defined interface, that is a great place to jump in with test doubles: at least you know that it is intended to be hard to change that API. Thus your tests are a little less fragile: you won't be making changes anyway. However often we want more granular isolation, and then test doubles stand in the way of change.

Figuring out the answer to this question is in fact one of the reasons I started this blog. I don't know what is right, but I'm starting to recognize a problem. Expect to see this topic explored in greater detail over time.

Note, don't take my comments on file IO as a statement that I believe you should mock file IO. There is actually a lot of debate on this. There are some people who say you never mock anything that you don't own both sides of. They will tell you never mock file IO.

Tuesday, November 11, 2014

Testing timers

In the testing world very little is said about how to test timers. The little advice given seems to be you isolate it, and code review those isolated areas instead of testing. That advice is okay for a lot of applications: waiting is generally a bad thing. Nobody finishes preparing a web page and then puts in a delay before sending it. As such timers are generally only in a couple (not very critical) places, and not a core part of any business logic.

What if you don't have that luxury, what if time is what your business logic is about. My team is on an embedded system. A large part of our business logic related to timers. For example, sending keep alive messages every 200ms, unless some other message has been sent. We have to deal with a number of different hardware devices with different rules, so getting all the logic correct is a significant source of potential bugs. To make it harder, we need to get certification from a third party on each release, mess up one of the rules and we can't sell anything. To just rely on code review isn't enough: we need tests that use timers.

A few years ago one of my teammates injected a timer into his class. It seemed like a good idea, now that it is injected he can inject a mock and test his time using class without waiting for real time to pass. Since his requirement was a 20 second timeout, and there were several border cases to test, this brought his test time down from over a minute to a few milliseconds. As a bonus, his tests could run on heavily loaded machine (ie the CI system) without random failures. What is not to love?

For several years we continued that way. It was great, our tests were fast, and deterministic despite testing time which is notorious for leading to slow, fragile tests. Somehow it always felt a little icky to me though. It took me several years before I figured out why, and then when reading someone else's code review out of context of what his program did. Timers doesn't feel like a dependency, they feels like an implementation detail. As such injecting timers leads to the wrong architecture.

Here is my surprising conclusion: timers are shared global state and should be used that way!

That still leaves the problem of how to test timers open, and those are real enough problems. I solved it by embracing global state. In C the linker will take the first function of a name it finds. We (my pair and I) took advantage of that fact to write a fake timer, with all the timer functions, but a different implementation. You set a timer callback and it goes to a list, but no real timer is created. Then we wrote and an advance time function that went through the list and triggered all the timers that expired in order. Last, we told the linker to use our new timer not the real thing in our tests.

Testing with the fake timer feels more natural than with injected test doubles: when using a double you tell each timer to fire, but you have to actually know the how the implementation uses the timer. The fake though just has one interface: AdvanceTime(). Advance the time 10 seconds, and the right things happens. A 1 second timer will fire 10 times, while a 7 second timer fires once - between the 1 second timer firing 7 and 8 times (or maybe 6 and 7 times - depending on ordering). If the 7 second timer starts a 2 second timer, than that new timer will fire at the simulated 9 seconds. The total time for this entire test: a couple milliseconds!

One additional bonus, our timer simulates time well. If you have a 200ms timer, advance time 50ms, and then restart the timer it will handle that correctly. If you cancel a timer it is canceled. Those testing with a mock timer had a lot of work to handle those cases, while it is easy with the fake. When the implementation decides to change the way the timer is used a bunch of unrelated tests broke, while the fake just handles that refactoring. Our timer supports an isRunning function - mocks typically did not model this state (and those that did often got it wrong somehow), while the fake handles all the special cases to ensure that when you ask is your timer running you get the right result.

It seems cool, even several months later. Maybe it is? I'm a little scared that as the inventor I'm missing something. I offer it to you though as something to try.

Tuesday, October 28, 2014

Why do we automate our tests?

Sounds like a simple question, but I see many people who don't actually understand why we automate our tests.

Before TDD I used to write a lot of one-off programs to test my code. While they were not exhaustive,they were pretty good - my code had few defects. Even with TDD my code is not perfect. I'm not sure what part of my better code today is because of TDD and what is because as an older programmer I have learned to write better code. So does automation gain anything?

Some people have claimed that 90% of automated tests never fail. I don't know if this number is correct, but my personal experience suggests it isn't too far off. So why not throw them away and forget about automation?

Some people claim that automation is done so you can measure coverage of your tests. Someday I will write about this in more detail. For now, if test coverage numbers are your goal, you can get those numbers without automation. So again automation hasn't proven itself. (I will admit here that getting coverage without automation probably requires a lot more effort than automation, but the point is you don't need automation).

Why automate? I'll go back to the earlier claim that 90% of tests never fail. That leaves 10% of tests that at some point in the future have your back. That is someday in the future you, will make a change thinking all is well, not realizing some obscure side effect that broke some other seemingly unrelated feature. Fortunately some test will save you by informing you of that side effect.

If only we knew which 10%, we could throw the rest away and get all the value. However we don't know which tests those are. Instead we keep them all, maintain them all,and wait for them to run each build. All for the time we weren't perfect.

Those 10% are the real value of automated tests. My one-off programs worked great, but they were moments after they passed. The next day if I had to make a change I had to re-create them. Generally I created the ones I thought I needed, which was not the complete set, and I had no way to being sure that everything I skipped actually didn't matter. Thus the value of automated tests: all your tests are there and working for you when you need them.

So in conclusion, the reason to automate tests is your next code change, and has nothing to do with the current days effort. If your code has no future than automated tests are not useful.

Wednesday, October 22, 2014

Now that we have defined the test doubles. what next?

I previously defined the four different test doubles, but definitions are often not much help in figuring out what to use in your current situation. So let me look at the differences in more detail.

The first thing to notice is stubs and fakes are about what a particular function call returns, while mocks and spys are about what a function was called with. This fundamental difference is a minor point though in real use, how can you stub/fake know what to return without the function arguments it is getting anyway? Likewise your mocks need to return something when they are called. (I have never seen a spy used where it has to return something, but I suppose someone might come up with one)

There is a continuum between stub and fake, I don't know any hard rules to say where the boundary is, but in practice you generally know. A stub generally can only return a couple different canned values from a function, and that is it. A fake is more complex: it tracks state and previously set values to create the proper return values, while still taking some shortcuts.

A fake is generally complex enough that you have tests just for the fake. Either because you know the behavior you want to model and directly test drive it, or after the fact adding tests because changes to the fake keep causing hundreds of tests to fail in non-obvious ways. In the latter case the test are added out of defense programing, you made the mistake so you write tests to help the next person not make it.

A fake is also something you could ship to some customers - as a demo of the full product. Since a fake is a shortcut to something hard to get access to, the fake makes a great training simulator. I even know of one case where a fake of the database is what was shipped when they realized the fake actually did everything they needed without the overhead of a SQL database.

If you are still confused, don't worry. The difference between stub and fake isn't important. It wouldn't be hard to make an argument that there is no difference. I like a separation myself, but I'm not sure if I'm being silly.

Mock frameworks have a interesting side effect, once you have gone through the effort of creating the mock once, you can quickly create many stubs just by changing the return value of only the functions that are called. This is often the way mock frameworks are used.

Spys are most often used in an observer pattern, since there is no return value to worry about. This is too bad because mock breaks the arrange act assert pattern of good tests as you have to assert before you act.

Monday, October 20, 2014

Defining the Test doubles

There is a lot of confusion about test doubles, many programmers are not even aware of all the options. This is a shame because each option has value.

Stub is a implementation that returns canned values. This implies that some of the results can make no sense in the real world, but for the purposes of the test they are good enough.

Fake is a full implementation, but it doesn't do everything the real world implementation needs. It might use an in memory database, it might pretend to be a serial port with a device attached, but not actually use a real serial port.

Mock is a way to ensure functions are called in specific ways. Call a function with the wrong argument and the test fails.

Spy is a way to check what functions were called after the fact.

I hope that the above are obvious, but still it is good to have everything defined so we can have a conversation about test doubles.

Of course the next step is defining when to use each. That is a large topic that I hope to explore over time.

Tuesday, October 7, 2014

Should tests drive code?

On a code review recently someone asked a question about one test: what change did this test drive. The implication being that the test in question didn't do anything to drive the change under review. Is this even a valid question to ask?

I do not know what motivated the question, but if I can speculate. Test Driven Development is, for good reasons, all the rage these days. TDD is all about writing tests to drive code, which is a great way to get started.

What few people realize TDD misses is "works by design". There are many points in TDD cycle where the simplest thing that can possibly work has a side effect also causes some other useful behavior that users will rely on. These side effects need to be tested, even though the test will go green instantly. Otherwise some future person may refactor and lose that useful side effect without any tests breaking.

Sounds easy when I put it that way. In the real world it isn't. You may not even realize when you make the decision what the side effect is. Even if you know, there may be a lot of other code required before the test that would require the behavior can be written. In the former case you obviously wouldn't write it, while in the latter it tends to get shoved to the bottom of your every expanding todo list and forgotten.

Thus there is always need when working on code to ask the question "are the existing tests in this area actually sufficient?" If the answer is no, then you need to write the missing tests.

But wait, what if my premiss is wrong. There is also the possibility that the question is rooted in the statement that while the test itself good, it is outside the scope of the current story and review. This is actually a valid point, I'm all for breaking work up. One advantage of great version control systems is when you are in code and see an opportunity where you want more tests. When you need to write a test that isn't related to the current work, just check the current work in, update your source, make the change, and check in, and go back to your main work.