How to Troubleshoot

Being an effective troubleshooter often means thinking like a scientist. Many troubleshooters in the IT field do not often employ the battle-tested cognitive skills our scientific brethren have learned. A lot of these lessons apply to general problem solving as well, but this post will focus on and use examples from IT.

First off, take a moment to refresh ourselves on the basic steps involved in the Scientific Method. Roughly, it goes:

Determine your Purpose or ask a Question
Do some background Research
Come up with a Hypothesis
Design and execute good Experiments
Review and analyze your Data
Communicate your Conclusions

As (T2+) troubleshooters– lucky us– we don’t have to wrack our brains or flex our creative muscles to determine our Purpose. Odds are our Purpose just left us panicked voicemail or sent us an email marked as “High Importance” with “[URGENT]” in the subject.

We also largely get to skip doing background Research because, usually, our Purpose sought us out for the fact that we already have enough background knowledge to jump right in.

As troubleshooters, we use a slightly modified variant of the Scientific Method:

Someone reports a Problem
Perform initial Observations
Come up with a Hypothesis
Design and execute good Experiments
Communicate your Conclusions
(T3 support) Develop and implement solution

Initial Observations

When first facing a problem, we have to approach it with the critical eye of an absolute skeptic; trust, but verify. Tales from Tech Support reminds us all why it’s important to do this. Do not forget, however, that our goal is to solve a problem; if we condescend, we may end up just creating more.

On the technical side, we will already have some Expected Behaviour ( $B_{e}$ ) in mind. So what we have to do is make Observations about the actual behaviour being exhibited ( $B_{o}$ ). The problem, is essentially the difference between the two ( ${\Delta}B = \lvert B_e - B_o \rvert$ ). If there was no difference, then there wouldn’t be a problem in the first place.

At the end of this phase you should be able to describe “what is wrong” in your own words, based on your own experiences (without taking anything on faith).

Forming a Hypothesis

We will also have a mental model of the system in mind, i.e. some set of “rules”, like $Rules = \{ r_0, r_1, …, r_n \}$ . We might (if you really broke it down) have hundreds, if not thousands, of little “rules” in our mental model of the system. Some example rules that might be in the model:

$r_{0} =$ The computer is on.
$r_{1} =$ The Internet is connected.
…
$r_{114} =$ The Contacts entity exists.
…
$r_{270} =$ When a new Contact is added, a workflow is triggered that creates a Task.
…
$r_{638} =$ In this C# code, the variable “contactsList” is declared and instantiated.

Don’t actually write this down, just be mindful that this is essentially your list of assumptions about the system (and you know what they say about assuming).

Forming a Hypothesis involves recalling the ${\Delta}B$ and trying to come up with a story that explains that difference based on our mental model. When the observed behaviour clashes with our expectations based on our model, it can be confusing. Note that confusion isn’t a bad thing– in fact, confusion is a Clue.

When you notice that you are confused:

Either Your Model Is False Or This Story Is Wrong.
Your Strength as a Rationalist – Eliezer Yudkowsky

This means that either one of the rules in our model is false (or missing) or what we are think we are observing is not truly what is occurring.

Good hypotheses should be specific, testable, and consistent with the observed behaviour.

Experimenting

Coming up with good experiments that will quickly and effectively test our hypothesis is one of the most challenging and rewarding aspects of troubleshooting. It requires a quick wit, an active imagination, the ability to think outside the box, and strong spatial reasoning skills. This is also highly domain-dependent; the types of tests we can reasonably conduct on a relational database are quite different from the types of tests we may try conducting on a web page, which are very different from the types of tests we might try when diagnosing a network or connectivity issue which are all radically different from the type of tests a doctor might run on a patient or the types of tests a mechanic might perform on your car. At the end of the day, it’s all of troubleshooting.

Getting a wide breadth of experience with different technologies and platforms will increase our range, as troubleshooters. See this article from Less Wrong about being the “Pareto” best in order to get more reach at the potential problem space. The takeaway is that (following the Pareto Principle) we can expect to get to 80% of the way to mastery with 20% of the effort required to reach 100%. By spending this 20% effort on 5 fields, instead of 100% on one, we can become significantly more effective troubleshooters.

Going back to our analogy to the Scientific Method, when designing experiments, we need to keep three things in mind at all time:

Independent Variables
Dependent Variables
Fixed (or “Controlled”) Variables

Independent variables are what we will be changing in the test, dependent variables are what we will be looking for changes in, and fixed variables are things that should not change at all in the test.

A common trap many troubleshooters find themselves in is not properly accounting for the fixed variables. That is to say that they do not actually understand what it is they tested. Perhaps they thought they were testing one independent variable, but were unwittingly testing two or three different independent variables because of poor test design. This can lead to confounding variables and arriving at incorrect conclusions.

Let’s look at a very simple hypothetical problem, and look at two potentials tests and break down what the different variables are.

Situation: A user cannot access our website.

Test #1: Have the user try from a different browser.
Independent Variables: Browser-specific possibilities (user agent, rendering engine, cache, extensions, etc.)
Dependent Variables: Ability to access the website.
Fixed Variables: Network connectivity, webserver configuration, user profile.

Test #2: Have the user try from a different computer.
Independent Variables: Browser-specific possibilities (user agent, rendering engine, cache, extensions, etc.), Network configuration, user profile
Dependent Variables: Ability to access the website.
Fixed Variables: Webserver configuration.

In Test #1, we only had one independent variable, and many fixed variables. This means that if the test succeeds we can be reasonably confident that the single independent variable is the culprit, and we can tackle it more pointedly in subsequent tests and attempted fixes. If the test fails, however, we have only ruled out one potential problem.

In Test #2, we do the inverse, instead we have many independent variables, and only one fixed variable. If the test succeeds, we have only ruled out one potential cause. However, if the test fails then we may want to look harder at that one fixed variable.

As if we didn’t have enough to juggle already, yet another factor to take into consideration is the client’s time. In this specific example, we are having a user perform the tests for us. So while tests such as #1 may methodically rule out potential causes, it can also be the most time-intensive option. Alternatively, tests such as #2 quickly let us rule out our webserver as at-fault. Depending on the situation this may actually be where you halt the troubleshooting process and refer the user to Tier 1 support alternatives.

Another tool in our belt is intuition. We don’t have to blindly pursue unlikely causes. What we need to do is to Focus Our Uncertainty and focus on tests that rule out likely causes. For instance, in our example, the first thing to check is probably not the user’s hosts file to see if they happened to mess up our site’s name resolution.

Putting it all together in an example

Let’s look at a small, specific example. Try not to find the bug right away, instead, let’s go through the above process as a thought experiment.

Suppose we have a just developed a proof-of-concept web application (JSFiddle) that we want to test.

When we fire up our web app, everything looks good. We see the list of contacts and each name has an “Add Friend” button next to it. However, much to our dismay, it seems that none of the buttons do what we want!

Clicking “Add Friend”, just displays the message “Adding friend #: 3”. After that it is unclear what the buttons are doing.

$B_{e}$ : Clicking the button Add Friend will display the message: “Adding Friend #: <id>” where <id> is the same number as the value in the “ID” column in the table.

$B_{o}$ : Clicking any Add Friend button displays the message: “Adding Friend #: N” where N is the number of rows plus 1. No further button clicks appear to have any effect.

${\Delta}B$ : Instead of displaying the ID of the row on which the button was clicked, it just states the number of rows in the table.

Clearly, our target hypothesis-space should be focused on the event handler function.

We may try adding or removing contacts from the array to check to see if it’s somehow just always “3”, or if it’s actually “the number of contacts”. (Try this for yourself in the JSFiddle!)

We may also try changing the div-based message to an alert call so we can be confident the event handler is actually firing each time, or if it’s getting stuck somehow. (Again, try for yourself, you’ll see that it is, in fact firing each time.)

Your strength as a rationalist is your ability to be more confused by fiction than by reality.
Your Strength as a Rationalist – Eliezer Yudkowsky

At this point, we may be feeling confused, because the button’s event handler is set dynamically at each iteration, and it looks like it should be displaying whatever i was at each step, after all, the rows have the right ID displayed…

button.onclick = function() {
  document.getElementById('buttonMsg').innerText = 'Adding friend #: ' + i;
};

Thanks to we our earlier tests, we can confidently narrow down the problem to i. At this point, the “troubleshooting” exercise is complete. We managed to positively identified the source of the problem down to the line number.

If you’re curious about how to actually fix this bug, I suggest you check out my other blog post about Encapsulating Your Scope When Looping With Asynchronous Functions. Here’s the solution (JSFiddle):

In this case, the event handler is essentially “an asynchronous function”, as the handlers are all declared synchronously, but then clicked by the user at some later point in time, after which i has increased up to the length of the contacts array.

button.onclick = (function(x) {
  return function() {
    document.getElementById('buttonMsg').innerText = 'Adding friend #: ' + x;
  };
}(i));

Communication

The most important component of troubleshooting is actually communicating your findings. (e.g. the client who contacted us, or to someone who is able to develop the fix). Sometimes certain solutions require decisions or client sign-off. In these cases, it doesn’t matter if we know exactly what the problem is and exactly how to fix it if we can not communicate either idea to anyone else.

Remember that coming fresh out of a troubleshooting session, we will have far more context than they do, so do not Expect Short Inferential Differences.

We always know what we mean by our words, and so we expect others to know it too. Reading our own writing, the intended interpretation falls easily into place, guided by our knowledge of what we really meant.
Illusion of Transparency: Why No One Understands You – Eliezer Yudkowsky

Here are some tips for technical communication that I wish every IT professional would take to heart:

Avoid excessive pronouns (e.g. this, that, it, the same, etc.)
Create analogies for laypeople
Use examples
Include screenshots
Include links to documentation that helps provide context