1.4 Risks of Testing

One of the consistencies in life that I have discovered is that if something is of value, there are risks associated with it.

There are risks associated with testing.

Generally, the objective of Testing is to decrease the probability of bugs being released into our code that is delivered to the customer. More generally, we are trying to make things better.

Unfortunately, as with any process or tool, there are certain risks associated with Testing. These risks can actually increase the probability of problems appearing in the system. Generally, this is due to a misuse and mishandling of what the testing processes are attempting to achieve and what they are capable of.

Once understood, it is possible to mitigate the risks associated with Testing.

False Security


In 1981, studies discovered something interesting: removing a doctor's mask during surgery actually reduced incidents of post-surgical infection. Several causes for these results have been postulated: that the addition of the protective equipment had lead to the doctors taking larger risks, or that improper use of the protective equipment (due to complacency) had led to an increase in problems. Over the years similar situations have been identified, where the addition of safety equipment has led to an increase in incidents. To pull a couple examples from Wikipedia:

  • Skydiving
  • Anti-lock Braking Systems

This effect, known as Risk Compensation, is controversial, however there are some standard elements that can be derived:

  • Increasing the use of safety equipment, increases the risk tolerance. Effectively, with safety equipment you can get closer to "the edge". This is the positive effect we are looking for.
  • The perceived capabilities of the equipment can exceed the actual capabilities. In other words, people may come to rely on the safety equipment and become complacent in their behaviour. This is the negative effect we need to watch for.

The balance between the two original examples is what we are striving for. In the case of Seatbelt usage, people tend to behave a little more dangerously, but receive a lot more protection.

In software development, this translates to higher productivity (greater risk taking), with reduced bug reporting (greater protection). The problem becomes how to control the negative side: by putting the safety mechanism in place some people tend to rely on the tool rather than their own good sense. To paraphrase a firearms instructor (Sanjay Malhotra), The safety on a firearm is installed between your ears.

Testing is a quality control tool which acts as a safety mechanism to prevent Users from receiving a poor customer experience. By placing testers between the development team and the end user, developers can experience a sense of dissociation from the users. While this dissociation often increases productivity, it can also decrease the sense of responsibility. Instead of conducting thorough investigations of the components they produce, they begin to feel that the safety net (the testers) will catch the ones that get through, and therefore not worry about them as much. Rather than conducting their own investigation, they will rely on the investigation of the tester to catch their problems for them.

This causes two problems:
  • The developer is the person that is most familiar with the system and is, therefore, the most likely to be aware of the weaknesses and harmful edge cases; but is not looking at for them.
  • What was put in place as a redundant secondary system, has become the primary system with no redundancy.

To mitigate this issue, it is important to maintain a sense of responsibility and clear lines of accountability to individuals that are responsible for the quality of the system or its individual components.

Gaming the System


Goodhart's Law states that When a measure becomes a target, it ceases to be a good measure. One of the most evident applications of Goodhart's law is in the measurement of Gross Domestic Product (GDP). GDP is a measure of the productive wealth of a nation, but performs that measure by looking at the side effects of production. When governments set the objective increase the nation's GDP it will begin to manipulate the side effects (directly within its influence), not the national wealth (not within its influence). The side effects increase, but the wealth will remain the same.

The same effect can be observed in the grocery store. Chickens are sold by weight. Customers demand more food stuff per dollar. Unfortunately, the measure of "foodstuff" is the weight of the food. The objective is therefore transformed from “more foodstuff” to “more weight”. Now that the measure has become the objective, lower cost fillers can be used to increase the mass of the food. In the case of chicken, this is often achieved by injecting water into the meat. Since water is not a foodstuff, the measure of weight has become meaningless.

This same problem often becomes evident in testing. As testing is simply a measure of the health of a system, people will tend to focus on meeting "measurable goals". Pass/Fail measures of quality create the ability to use Volume of Pass as a quantitative measure. As people begin to observe the counts of successes, they will tend to view this as both a measure of their own success, and as an objective to be met.

There are fundamental flaws with treating qualitative data in a quantitative manner. Firstly, in an effort to meet objectives, test developers may make their tests easier to pass. In his article The Joel Test, Joel Spolsky identifies an extreme case of a Microsoft employee writing poor code simply to achieve a goal of “done” (quantity), while not meeting the spirit of “done” (quality).

The story goes that one programmer, who had to write the code to calculate the height of a line of text, simply wrote "return 12;" and waited for the bug report to come in about how his function is not always correct. The schedule was merely a checklist.... In the post-mortem, this was referred to as "infinite defects methodology".

Attention must be paid to not allowing the "score" to matter, or even better, not allow their to be a concept of "score" in the first place.

Team members must always be actively encouraged to create harder tests; pushing the system (and themselves) to become progressively better over time. Rather than a scoring system, testing should be viewed like athletes view their measures. When a weight lifter reaches his goal of 200 kilograms, he redefines the goal by adding 20 kilograms; when a runner can run 10km, she increases the objective to 15km; when a developer gets a long loading report down to 5 seconds, he changes the objective to 3 seconds.

Managers can help foster this environment by never explicitly using the tests as a measure of performance. Rather they should incentivise efforts to increase the thoroughness of testing.

Personal Failure


Recently, I was asked by a developer to come over to his desk to “look at something”. I dropped what I was doing, walked over to his desk, and stood by while he asked “is this working?”.

I was annoyed.

I had reviewed the work once already, and found a minor flaw in the solution. The success/failure criteria were clearly laid out in the ticketing system: either the problem had been fixed, or it hadn't. There was no need to pull me away from other testing, that had been in my queue for days, to run a test of the system on his computer, all he had to do was compare the results on his screen to what I had already said in writing. Worse, once he submitted the change, I would still be required to perform official, documented, and detailed, testing; all resulting in a longer amount of time being required to achieve the same results.

The worst part, this wasn't the first time this developer had done this, or the first time other developers had done it. When asked, developers have always had the same answer: I didn't want it to fail again. I find that statement odd since whether I'm sitting at my desk or theirs the test has still failed. However, from the developers perspective, so long as the failure isn't documented, it is perceived as less of a failure.

I don't understand the phenomenon, but whenever you tell a person that there is a problem with their work, they assume you mean you have a problem with them; that they are a failure. This is closely intertwined with Gaming the System, and is probably the root cause. We are taught from a very early age that the scores and grades we receive on tests are what is important about us, not the knowledge we have gained, not the process, just the final score. Therefore, we engage in trying to change our score. This leads to a problem wherein people view a poor score as a reflection of themselves. Scores and measures are synonyms, therefore if you identify a failure, they take it as a personal attack on them. This must not be allowed to happen.

Success and Failure of tests is about creating dialogue.

A successful run of a test does not mean that the system is behaving well, instead it means that the person who defined the test, and the person that built the system agree as to what the behaviour should be.

It is possible that both are wrong.

Given that both could be wrong, in a success case; it stands that either could be wrong in a disagreement case. There are three states:

Developer
Pass Fail
Tester Pass Agree-Pass Disagree
Fail Disagree Agree-Fail

Agree-Pass
Both measurers (the test and the development) agree that the expectations have been met
Disagree

We don’t know who is correct, but there needs to be further discussion to determine what the customer’s expectations are. This may be cleared up by a quick discussion (oops, missed something), or could result in the entire project having to be redesigned due to a flawed assumption. In either case, the underlying truth has been discovered.

The key is we have found a problem and are dealing with it.

Agree-Fail
This should never occur. If the developer feels it has not passed their standards, it should not be leaving their desk.

Developers must be aware of this ingrained response we have as humans, and fight it. Failing a test only means that there is a difference of expectations, a misunderstanding. Testing exists to find those misunderstandings.