Wednesday, November 26, 2014

What is Code Coverage worth?

I've run across a few people say code coverage is the reason to write automated tests. When you are done you can run some tool and see that you have 92.5385% coverage (as if all those digits are statistically significant). I have always questioned this claim: numbers are meaningless.

Nobody has actually told me why I should care about the number. Oh sure, we can all agree that 17% looks pretty bad, and 95% looks pretty good, but so what.  If you are that 17% what are you going to do about it? I know a number of successful products that seem pretty stable that are much worse than 20% - in fact nearly much every program up until 2005 has 0 coverage and nobody knew to care. I also know of code with > 90% coverage that had (has?) significant bugs.

People who use code coverage tell me that it is useful when:

You have a new team member. Looking at coverage for the code he creates you know if he is creating tests. You can have a conversation about team expectations if test coverage is not near to what the rest of the team is doing.

You think you are doing TDD - if you see less than 100% coverage it means you didn't TDD that code. These are the places to go back and delete the code until you write the failing test case that requires it.

You have a legacy system with some tests. By considering coverage and relative risk you can decide which areas to put technical focus on first. Pragmatically you know that when working with legacy code you cannot fix everything today (customers won't buy technically better, they buy new features), so you need to prioritize. You should be able to get some time/money from the business for technical work (if you can't you have other problems): coverage is an input in the decision of what to work on with that time.

I'll be honest, I've only heard of teams successfully using coverage for the above.  I've never seen any of that in a project I've been one.  By contrast I have seen all of the following ways that code coverage can be bad.  Code coverage is dangerous if:

You think coverage code is a sign of quality. It is very easy to write tests that cover a line of code without actually testing that the line of code works. The following trivial example gets 100% coverage, but the code is wrong.
int add(int a, int b) { return 1;}
TEST(add) { add(2,2);}
You assign a target number. This is partially related to the above, people will sometimes write bad tests just to get the target. Ignoring that, some code needs more testing than others. Some code just cannot be usefully tested - multi-threaded code may have a bunch of untested mutexes. In a compiled language Getters and Setters don't need testing - how can they possibly fail. In the first case you use careful code review because nobody has figured out how to write useful tests, in the second nobody cares because the code can't possible be wrong. On the other hand, for single threaded business logic classes 100% code coverage shouldn't be hard, and so you should go above the target.

Your boss knows what the numbers are. This is a variation of both of the above with a twist worth noting: because the boss is looking you should expect to see coverage numbers in your yearly reviews, and they might affect your pay. While not morally justified, it is fully understandable why you would write extra "tests" just to ensure you are exceeding his target coverage numbers even if the tests are of no value.

Given that I've never seen a good use for coverage, I have to ask why we bother to measure it. I would advise you not to measure coverage on a YAGNI basis: if you ever do decide you will use it usefully, you can measure it latter.

Thursday, November 20, 2014

Don't test implementation details.

Those who have the misfortune to have me on their code reviews see me repeat this often. Don't test implementation details.

What do I mean? In short, don't test something you are likely to refactor latter. Find a test point that isn't likely to change. All too often I see someone write a seemingly useful test, but it locks them into a design that shouldn't be locked in.

Mocks are the primary way I see this rule violated. Your code calls a function with some arguments, so you have your mock check for those. However there are often several different ways to use the function that in the end give the same result. Your tests should support changing implementations to see what is better.

For example, the standard file read function has a length argument, if you mock this, your mock needs to be powerful enough to handle calling it once with a buffer size of 1000 bytes, or twice with a 500 byte buffer. Both are equally valid, and depending one how you use the data and performance considerations one night be better. As hardware changes which is better may change. With good tests you can try many different read sizes to see if any make a difference in the real world.

Read the above again, but not with this in mind: Computers have been doing file IO for many years. We understand it well. It is very unlikely someone will come up with a new way to do file IO that is completely different from what your program supports today, and compellingly enough better that you would change your program to use it.

Contrast to your typical program. The internal API is very much in flux because nobody knows what the right way to solve the problem is. It seems to be taken as a matter of faith that you should break your tests up, but where do you break things up? Once you break something up you won't touch it that API again. Not that you can't touch it, just that anyone who does will be afraid of all the tests that break and give up instead.

One role of architecture is figure this out, but they only get a part of the work. Anytime architecture gives you a well defined interface, that is a great place to jump in with test doubles: at least you know that it is intended to be hard to change that API. Thus your tests are a little less fragile: you won't be making changes anyway. However often we want more granular isolation, and then test doubles stand in the way of change.

Figuring out the answer to this question is in fact one of the reasons I started this blog. I don't know what is right, but I'm starting to recognize a problem. Expect to see this topic explored in greater detail over time.

Note, don't take my comments on file IO as a statement that  I believe you should mock file IO. There is actually a lot of debate on this. There are some people who say you never mock anything that you don't own both sides of. They will tell you never mock file IO.

Tuesday, November 11, 2014

Testing timers

In the testing world very little is said about how to test timers. The little advice given seems to be you isolate it, and code review those isolated areas instead of testing. That advice is okay for a lot of applications: waiting is generally a bad thing. Nobody finishes preparing a web page and then puts in a delay before sending it. As such timers are generally only in a couple (not very critical) places, and not a core part of any business logic.

What if you don't have that luxury, what if time is what your business logic is about. My team is on an embedded system. A large part of our business logic related to timers. For example, sending keep alive messages every 200ms, unless some other message has been sent. We have to deal with a number of different hardware devices with different rules, so getting all the logic correct is a significant source of potential bugs. To make it harder, we need to get certification from a third party on each release, mess up one of the rules and we can't sell anything. To just rely on code review isn't enough: we need tests that use timers.

A few years ago one of my teammates injected a timer into his class. It seemed like a good idea, now that it is injected he can inject a mock and test his time using class without waiting for real time to pass. Since his requirement was a 20 second timeout, and there were several border cases to test, this brought his test time down from over a minute to a few milliseconds. As a bonus, his tests could run on heavily loaded machine (ie the CI system) without random failures. What is not to love?

For several years we continued that way. It was great, our tests were fast, and deterministic despite testing time which is notorious for leading to slow, fragile tests. Somehow it always felt a little icky to me though. It took me several years before I figured out why, and then when reading someone else's code review out of context of what his program did. Timers doesn't feel like a dependency, they feels like an implementation detail. As such injecting timers leads to the wrong architecture.

Here is my surprising conclusion: timers are shared global state and should be used that way!

That still leaves the problem of how to test timers open, and those are real enough problems. I solved it by embracing global state. In C the linker will take the first function of a name it finds. We (my pair and I) took advantage of that fact to write a fake timer, with all the timer functions, but a different implementation. You set a timer callback and it goes to a list, but no real timer is created.  Then we wrote and an advance time function that went through the list and triggered all the timers that expired in order. Last, we told the linker to use our new timer not the real thing in our tests.

Testing with the fake timer feels more natural than with injected test doubles: when using a double you tell each timer to fire, but you have to actually know the how the implementation uses the timer. The fake though just has one interface: AdvanceTime(). Advance the time 10 seconds, and the right things happens.  A 1 second timer will fire 10 times, while a 7 second timer fires once - between the 1 second timer firing 7 and 8 times (or maybe 6 and 7 times - depending on ordering).  If the 7 second timer starts a 2 second timer, than that new timer will fire at the simulated 9 seconds. The total time for this entire test: a couple milliseconds!

One additional bonus, our timer simulates time well. If you have a 200ms timer, advance time 50ms, and then restart the timer it will handle that correctly. If you cancel a timer it is canceled. Those testing with a mock timer had a lot of work to handle those cases, while it is easy with the fake. When the implementation decides to change the way the timer is used a bunch of unrelated tests broke, while the fake just handles that refactoring. Our timer supports an isRunning function - mocks typically did not model this state (and those that did often got it wrong somehow), while the fake handles all the special cases to ensure that when you ask is your timer running you get the right result.

It seems cool, even several months later. Maybe it is? I'm a little scared that as the inventor I'm missing something. I offer it to you though as something to try.