If you need to identify problems before they become huge issues, then you’ll need to have a good baseline. In this video, you’ll learn about different baseline types and how bottlenecks can be identified when compared with historical information.

<< Previous: Wireless Tools Next: Log Management and Graphing >>

Baseline is a pretty broad term. But in technology, it usually revolves around a set of metrics that are important to us. If we are application developers, maybe we’re most concerned about application response time. If we’re the network team, we might be interested with utilization information or being able to understand how many people are using a particular network resource at any particular time.

Baselines are very good to be used as a point of reference. We’re able to accumulate this data over a long period of time, and then examine that data. You can very often find information hidden within this data by looking at it over a very long trend. We can also take what’s happened in the past and use it to help predict what’s going to happen in the future. If we see slow growth over a network link, we may decide that we need to upgrade after a certain amount of time.

And this allows us to start adding some dollar figures to these baselines. We can start planning when to purchase new products, or when to upgrade network segments, based on what we’ve seen in the past. And we can buy those at exactly the right time. Instead of wasting money in overengineering the network, we can create a network that’s perfectly sized for our purposes.

I run baselines on the Professor Messer websites, on my load balancers, on my database servers. I try to get an understanding of what’s happening. And I grabbed one of these baselines, just to give you a feel for some of the things that I look at. This is a baseline showing how many people are accessing my web server– this Apache Web server. And I’ve broken out by day, by week, by month, and by year, so I can get an understanding by day of what’s happening at different times of the day.

But look at what happens when I start looking at this over a week-long period. You can almost see the Monday through Friday, as people are accessing the website. And then on the weekends, it dies down a little bit. But you can see on some weeks, it’s a little bit higher. And some weeks, it’s a little bit lower. But we can start to see some trends here. Especially if we look by month, you can very easily see when certain trends are occurring, and when things are back to normal. And this might help me understand more about when I need to increase the amount of computing resources. Do I need to spin up an extra web server so that I can handle this increased load at certain times of the year? And by looking at these baselines, I can really get an understanding of how important that might be.

If we’re going to start making changes to the network or re-engineering parts of our design, then we need to understand where the bottlenecks are. And the bottlenecks are going to be associated with many different things. If we look at a network, we have a connection from the internet that’s going to run at a particular speed. So there’s a bottleneck associated with that. We also have, inside of our networks, switches and routers. Those devices can only send traffic at a certain speed. Firewalls and other devices can also have limitations on how much traffic can really go through those devices.

And it’s not just one metric, either. We’re concerned, of course, about bandwidth. We want to know information about how many flows per second can travel through a particular device. We want to know how many sessions per second that device can create or tear down at any particular time. So we have to look at a lot of different metrics to really understand the overall impact of these bottlenecks.

We can even look at bottlenecks at a device level. Perhaps inside of a server, we’re concerned about things like I/O bus, the input output bus, and understanding just how much data can be transferred from one component to another inside of that server. We want to understand CPU speed, especially if our applications are very CPU-intensive. How much access speed we can get from our storage devices. How fast the network is going to flow. So we could start examining all of the different components inside of that device to really understand the impact that it’s going to have on the overall performance.

We only need to find that one device that’s not performing well, and you’ll see that everything else suffers overall. If you have a storage device, for instance, that is not performing well, then the I/O bus is going to sit idle. The CPU won’t need to calculate anything. There’s no traffic going out of the network interface. So by increasing the storage input and output, you’ll increase the overall performance of the application.

So how do you find that weak link? Well, the important part is to monitor as many different components as you. Want to look at the storage speed. You want to understand CPU and network and memory utilization, and get an idea of just how much of that is occurring at any particular time. This can be more difficult than you might think. Some of these components don’t have an easy way to gather metrics from them. Other components are easily monitored through the operating system. Sometimes you need a third party device connected to the network to be able to get an overall understanding. Perhaps you query some devices with SNMP, and others are queried using NetFlow statistics. You need to be able to bring all of this data together to really understand what’s going on. And since the data sources can be very diverse, that can sometimes be a bit of a challenge.

Here’s a graph that describes the resolution of a bottleneck that was on my network. And I had a baseline that I could look at that goes a number of months back to really understand what the differences were between past performance and what was happening today. And looking at the graph that I had, I saw that the PHP was relatively normal. The amount of caching and web traffic external access was relatively normal. But this very large orange area that’s labeled as database was much higher than it ever was in any of my baselines. Normally, the amount of time it takes to access my database is very, very low.

And in my environment, these services are very distributed. I have a web server on one device. And my database server is on a completely separate device. So my concerns around this very large response time were really based on the way the traffic flows. I know that people hit a web server. The web server queries the database server. The database server responds back to the web server. And the web server finally responds to the client. Well, during that entire process, we’re dealing with multiple devices. Each one of those devices has memory and CPU and storage. There’s network connections between all of these devices. So immediately, when I saw this very high response time, I started narrowing down where the particular problem might be.

And I found that the bottleneck was really occurring on the network side. Even though this really speaks to a database query, we know that there is a communication across the network to make that happen. And looking at my network statistics was generally the first place I go. If your network is not performing well, then nothing is going to perform well. And I saw in my network statistics that there were a number of errors being shown on my network interface itself.

Well, that was an easy fix. I called my provider. Told them that I was seeing physical errors on my network link. They simply made some changes on their back end with the network connectivity. And you can see immediately I got a performance increase, and everything went back to normal. This is a good example of how important it is to have that historical baseline. I’m able to compare what happened historically with what’s happening right now, really narrow down where the differences are, and then work towards resolving that particular bottleneck.