The Well Tested Programmer

Thursday, November 20, 2014

Don't test implementation details.

Those who have the misfortune to have me on their code reviews see me repeat this often. Don't test implementation details.

What do I mean? In short, don't test something you are likely to refactor latter. Find a test point that isn't likely to change. All too often I see someone write a seemingly useful test, but it locks them into a design that shouldn't be locked in.

Mocks are the primary way I see this rule violated. Your code calls a function with some arguments, so you have your mock check for those. However there are often several different ways to use the function that in the end give the same result. Your tests should support changing implementations to see what is better.

For example, the standard file read function has a length argument, if you mock this, your mock needs to be powerful enough to handle calling it once with a buffer size of 1000 bytes, or twice with a 500 byte buffer. Both are equally valid, and depending one how you use the data and performance considerations one night be better. As hardware changes which is better may change. With good tests you can try many different read sizes to see if any make a difference in the real world.

Read the above again, but not with this in mind: Computers have been doing file IO for many years. We understand it well. It is very unlikely someone will come up with a new way to do file IO that is completely different from what your program supports today, and compellingly enough better that you would change your program to use it.

Contrast to your typical program. The internal API is very much in flux because nobody knows what the right way to solve the problem is. It seems to be taken as a matter of faith that you should break your tests up, but where do you break things up? Once you break something up you won't touch it that API again. Not that you can't touch it, just that anyone who does will be afraid of all the tests that break and give up instead.

One role of architecture is figure this out, but they only get a part of the work. Anytime architecture gives you a well defined interface, that is a great place to jump in with test doubles: at least you know that it is intended to be hard to change that API. Thus your tests are a little less fragile: you won't be making changes anyway. However often we want more granular isolation, and then test doubles stand in the way of change.

Figuring out the answer to this question is in fact one of the reasons I started this blog. I don't know what is right, but I'm starting to recognize a problem. Expect to see this topic explored in greater detail over time.

Note, don't take my comments on file IO as a statement that I believe you should mock file IO. There is actually a lot of debate on this. There are some people who say you never mock anything that you don't own both sides of. They will tell you never mock file IO.

Tuesday, November 11, 2014

Testing timers

In the testing world very little is said about how to test timers. The little advice given seems to be you isolate it, and code review those isolated areas instead of testing. That advice is okay for a lot of applications: waiting is generally a bad thing. Nobody finishes preparing a web page and then puts in a delay before sending it. As such timers are generally only in a couple (not very critical) places, and not a core part of any business logic.

What if you don't have that luxury, what if time is what your business logic is about. My team is on an embedded system. A large part of our business logic related to timers. For example, sending keep alive messages every 200ms, unless some other message has been sent. We have to deal with a number of different hardware devices with different rules, so getting all the logic correct is a significant source of potential bugs. To make it harder, we need to get certification from a third party on each release, mess up one of the rules and we can't sell anything. To just rely on code review isn't enough: we need tests that use timers.

A few years ago one of my teammates injected a timer into his class. It seemed like a good idea, now that it is injected he can inject a mock and test his time using class without waiting for real time to pass. Since his requirement was a 20 second timeout, and there were several border cases to test, this brought his test time down from over a minute to a few milliseconds. As a bonus, his tests could run on heavily loaded machine (ie the CI system) without random failures. What is not to love?

For several years we continued that way. It was great, our tests were fast, and deterministic despite testing time which is notorious for leading to slow, fragile tests. Somehow it always felt a little icky to me though. It took me several years before I figured out why, and then when reading someone else's code review out of context of what his program did. Timers doesn't feel like a dependency, they feels like an implementation detail. As such injecting timers leads to the wrong architecture.

Here is my surprising conclusion: timers are shared global state and should be used that way!

That still leaves the problem of how to test timers open, and those are real enough problems. I solved it by embracing global state. In C the linker will take the first function of a name it finds. We (my pair and I) took advantage of that fact to write a fake timer, with all the timer functions, but a different implementation. You set a timer callback and it goes to a list, but no real timer is created. Then we wrote and an advance time function that went through the list and triggered all the timers that expired in order. Last, we told the linker to use our new timer not the real thing in our tests.

Testing with the fake timer feels more natural than with injected test doubles: when using a double you tell each timer to fire, but you have to actually know the how the implementation uses the timer. The fake though just has one interface: AdvanceTime(). Advance the time 10 seconds, and the right things happens. A 1 second timer will fire 10 times, while a 7 second timer fires once - between the 1 second timer firing 7 and 8 times (or maybe 6 and 7 times - depending on ordering). If the 7 second timer starts a 2 second timer, than that new timer will fire at the simulated 9 seconds. The total time for this entire test: a couple milliseconds!

One additional bonus, our timer simulates time well. If you have a 200ms timer, advance time 50ms, and then restart the timer it will handle that correctly. If you cancel a timer it is canceled. Those testing with a mock timer had a lot of work to handle those cases, while it is easy with the fake. When the implementation decides to change the way the timer is used a bunch of unrelated tests broke, while the fake just handles that refactoring. Our timer supports an isRunning function - mocks typically did not model this state (and those that did often got it wrong somehow), while the fake handles all the special cases to ensure that when you ask is your timer running you get the right result.

It seems cool, even several months later. Maybe it is? I'm a little scared that as the inventor I'm missing something. I offer it to you though as something to try.

Tuesday, October 28, 2014

Why do we automate our tests?

Sounds like a simple question, but I see many people who don't actually understand why we automate our tests.

Before TDD I used to write a lot of one-off programs to test my code. While they were not exhaustive,they were pretty good - my code had few defects. Even with TDD my code is not perfect. I'm not sure what part of my better code today is because of TDD and what is because as an older programmer I have learned to write better code. So does automation gain anything?

Some people have claimed that 90% of automated tests never fail. I don't know if this number is correct, but my personal experience suggests it isn't too far off. So why not throw them away and forget about automation?

Some people claim that automation is done so you can measure coverage of your tests. Someday I will write about this in more detail. For now, if test coverage numbers are your goal, you can get those numbers without automation. So again automation hasn't proven itself. (I will admit here that getting coverage without automation probably requires a lot more effort than automation, but the point is you don't need automation).

Why automate? I'll go back to the earlier claim that 90% of tests never fail. That leaves 10% of tests that at some point in the future have your back. That is someday in the future you, will make a change thinking all is well, not realizing some obscure side effect that broke some other seemingly unrelated feature. Fortunately some test will save you by informing you of that side effect.

If only we knew which 10%, we could throw the rest away and get all the value. However we don't know which tests those are. Instead we keep them all, maintain them all,and wait for them to run each build. All for the time we weren't perfect.

Those 10% are the real value of automated tests. My one-off programs worked great, but they were moments after they passed. The next day if I had to make a change I had to re-create them. Generally I created the ones I thought I needed, which was not the complete set, and I had no way to being sure that everything I skipped actually didn't matter. Thus the value of automated tests: all your tests are there and working for you when you need them.

So in conclusion, the reason to automate tests is your next code change, and has nothing to do with the current days effort. If your code has no future than automated tests are not useful.

Wednesday, October 22, 2014

Now that we have defined the test doubles. what next?

I previously defined the four different test doubles, but definitions are often not much help in figuring out what to use in your current situation. So let me look at the differences in more detail.

The first thing to notice is stubs and fakes are about what a particular function call returns, while mocks and spys are about what a function was called with. This fundamental difference is a minor point though in real use, how can you stub/fake know what to return without the function arguments it is getting anyway? Likewise your mocks need to return something when they are called. (I have never seen a spy used where it has to return something, but I suppose someone might come up with one)

There is a continuum between stub and fake, I don't know any hard rules to say where the boundary is, but in practice you generally know. A stub generally can only return a couple different canned values from a function, and that is it. A fake is more complex: it tracks state and previously set values to create the proper return values, while still taking some shortcuts.

A fake is generally complex enough that you have tests just for the fake. Either because you know the behavior you want to model and directly test drive it, or after the fact adding tests because changes to the fake keep causing hundreds of tests to fail in non-obvious ways. In the latter case the test are added out of defense programing, you made the mistake so you write tests to help the next person not make it.

A fake is also something you could ship to some customers - as a demo of the full product. Since a fake is a shortcut to something hard to get access to, the fake makes a great training simulator. I even know of one case where a fake of the database is what was shipped when they realized the fake actually did everything they needed without the overhead of a SQL database.

If you are still confused, don't worry. The difference between stub and fake isn't important. It wouldn't be hard to make an argument that there is no difference. I like a separation myself, but I'm not sure if I'm being silly.

Mock frameworks have a interesting side effect, once you have gone through the effort of creating the mock once, you can quickly create many stubs just by changing the return value of only the functions that are called. This is often the way mock frameworks are used.

Spys are most often used in an observer pattern, since there is no return value to worry about. This is too bad because mock breaks the arrange act assert pattern of good tests as you have to assert before you act.

Monday, October 20, 2014

Defining the Test doubles

There is a lot of confusion about test doubles, many programmers are not even aware of all the options. This is a shame because each option has value.

Stub is a implementation that returns canned values. This implies that some of the results can make no sense in the real world, but for the purposes of the test they are good enough.

Fake is a full implementation, but it doesn't do everything the real world implementation needs. It might use an in memory database, it might pretend to be a serial port with a device attached, but not actually use a real serial port.

Mock is a way to ensure functions are called in specific ways. Call a function with the wrong argument and the test fails.

Spy is a way to check what functions were called after the fact.

I hope that the above are obvious, but still it is good to have everything defined so we can have a conversation about test doubles.

Of course the next step is defining when to use each. That is a large topic that I hope to explore over time.

Tuesday, October 7, 2014

Should tests drive code?

On a code review recently someone asked a question about one test: what change did this test drive. The implication being that the test in question didn't do anything to drive the change under review. Is this even a valid question to ask?

I do not know what motivated the question, but if I can speculate. Test Driven Development is, for good reasons, all the rage these days. TDD is all about writing tests to drive code, which is a great way to get started.

What few people realize TDD misses is "works by design". There are many points in TDD cycle where the simplest thing that can possibly work has a side effect also causes some other useful behavior that users will rely on. These side effects need to be tested, even though the test will go green instantly. Otherwise some future person may refactor and lose that useful side effect without any tests breaking.

Sounds easy when I put it that way. In the real world it isn't. You may not even realize when you make the decision what the side effect is. Even if you know, there may be a lot of other code required before the test that would require the behavior can be written. In the former case you obviously wouldn't write it, while in the latter it tends to get shoved to the bottom of your every expanding todo list and forgotten.

Thus there is always need when working on code to ask the question "are the existing tests in this area actually sufficient?" If the answer is no, then you need to write the missing tests.

But wait, what if my premiss is wrong. There is also the possibility that the question is rooted in the statement that while the test itself good, it is outside the scope of the current story and review. This is actually a valid point, I'm all for breaking work up. One advantage of great version control systems is when you are in code and see an opportunity where you want more tests. When you need to write a test that isn't related to the current work, just check the current work in, update your source, make the change, and check in, and go back to your main work.