The process for troubleshooting a network problem can be challenging without the right process. In this video, you’ll learn a step-by-step method for troubleshooting all types of network issues.
When you’re troubleshooting complex network problems, you may find that the resolution is not as obvious as you might hope. In this video, we’re going to step through a methodology that should help you troubleshoot any problem you run into. This is the flowchart of that network troubleshooting methodology, and we’re going to step through each section of this flow and describe how it can help you solve those really difficult problems.
The first thing you want to do is identify the problem. This may not be as straightforward as you might think. We first need to collect as much information as possible about the issue that’s occurring. In the best possible scenario, you’ll be able to duplicate this problem on demand. This will help later as we go through a number of testing phases to make sure that we are able to resolve this issue.
When a problem happens on the network, it usually affects more than one device, and sometimes it affects those devices in different ways. You want to be sure to document all of the symptoms that may be occurring. Even if they are very different between different devices, you may find that a single problem is causing all of these different systems across these different devices.
Many times, these issues will be identified by the end users, so they may be able to provide you with a lot more detail of what’s really happening. You should question your users to find out what they’re seeing and if any error messages are appearing. In this course, we’ve already discussed the importance of the change control process and knowing exactly what is changing in your environment.
Without some type of formal change control process, someone may be able to make an unscheduled change that would affect many different people. So when an error or network problem occurs, you may want to find out what was the last thing that changed on this network that could have affected all of these users. There’s also going to be times when you’re examining a number of different problems that may not actually be related to each other. It’s always best to separate all of these different issues out so that you can approach and try to resolve each issue individually.
Now that you’ve collected as much information as possible, you can examine all of these details to begin establishing a theory of what you think might be going wrong. Since the simpler explanation is often the most likely reason for the issue, that may be a good place to start. But of course, you’ll want to consider every possible thing that might be causing this issue. Maybe start with things that aren’t completely obvious.
You could start from the top of the OSI model with the way the application is working and work your way to the bottom. Or you may want to start with the bottom with the cabling and wiring in your infrastructure and work your way up from there. You’ll want to list out every possible cause for this problem. Your list might start with the easy theories at the top, but of course include all of the more complex theories in this list as well.
Now that we have a list of theories on how to resolve this issue, we can now test those theories. We may want to go into a lab. And if we are able to recreate this problem in the lab, then we can apply each theory until we find the one that happens to resolve the issue. If you tried the first theory, you may want to reset everything and try the second theory or the third. And if you run out of theories, you may want to go back and think of other things that might be causing this problem.
This might be a good time to bring in an expert who knows about the application or the infrastructure, and they can give some theories and possible resolutions to test in the lab. Once you’ve tested a theory and found that the theory is going to resolve this issue, you can then begin putting together a plan of action. This is how you would implement this fix into a production network.
You want to be sure that you’re able to do this with a minimum amount of impact to the production network, and sometimes you have to do this after hours when nobody else is working on the network. You want to be able to implement this with a minimum amount of impact to production traffic. So often, you’ll have to do this after hours.
A best practice is to document the exact steps that will be required to solve this particular problem. If it’s replacing a cable, then the process will be relatively straightforward. But if you’re upgrading software in a switch, a router, or a firewall, there may be additional tasks involved in performing this plan of action. You’ll also want some alternatives if your plan doesn’t go as designed. For example, you may run into problems when upgrading the software in a firewall. So you may need an additional firewall or way to roll back to the previous version.
Now that you’ve documented your plan of action, you can take that to your change control team, and they can give you a window when you can implement that change. The actual fixing of the issue is probably going to be during off hours during non-production times, and you may need to bring in other people to assist, especially if your window is very small.
Once you have executed on your plan of action, your job isn’t done yet. We need to make sure that all of these changes actually resolved the problem. So now that the changes have been implemented, we now need to perform some tests. We may want to bring in the end users who first experienced this problem so that they can run through exactly the same scenario to tell you if the problem is resolved or if the problem still exists. This might also be a good time to implement some preventive measures. That way, we can either be informed that the problem is occurring, or we can provide alternatives that we can implement if that problem happens again.
After the problem has been resolved, this is a perfect time to document the entire process from the very beginning to the very end. You’ll of course want to provide as much information as possible, so if somebody runs into this issue again, they can simply search your knowledgebase, find that particular error that popped up, and know exactly the process you used to solve this last time.
Many organizations have a help desk with case notes that they can reference, or you might have a separate knowledge base or wiki that you create where you’re storing all of this important information for the future. A document that was created a number of years ago but still shows the importance of keeping this documentation over time is from Google Research, where they documented the failure trends in a large disk drive population.
And because they were keeping extensive data over a long period of time, they were able to tell when a drive was starting to fail based on the types of errors that they were receiving. Being able to store all of this important information, being able to go back in time to see what happened, becomes a very important part of maintaining a network for the future.
Let’s summarize this troubleshooting methodology. We start with gathering as much information as possible, asking users about what they’re seeing, and documenting any specific error messages. Then we want to be able to create a number of theories that might solve this particular problem. And once we have this list, we want to be able to put it in the lab and try testing each one of these theories until we find the one that actually resolves the issue.
From there, we can create a plan of action and document any possible problems that might occur. We can then get a time to implement the issue and put it into our production environment. And then we can verify and test and make sure that the entire system is now working as expected. And of course, finally, we want to document everything that we did from the very beginning of our troubleshooting process all the way through to the end.