If your storage subsystem isn’t happy, then nobody is happy. In this video, you’ll learn some techniques for troubleshooting some challenging hard drive issues.
Hard drives are fast-spinning, mechanical devices that will eventually fail, so we need to understand exactly the process to go through when we run into any of these hard drive problems.
One of these problems that might occur is a read/write failure you might see a message that says, “cannot read from the source disk.” You might also have a hard drive that is giving very poor performance– maybe it’s constantly flashing with hard drive access activity lights, and it’s having to retry constantly to read or to write information to that drive.
Another obvious problem with hard drives is when you have a loud, clicking noise when you try to access that drive. We sometimes call this the click of death because this is certainly indicative of a major mechanical problem with your hard drive.
One of the first things you should do whenever you’re troubleshooting a hard drive problem is to backup your data. Make sure that you always have a backup of this information. That’s a good idea always, but especially when you’re recognizing that a problem may be occurring. Then you might want to check for the easy problems. See if any cables are loose are not connected properly. If there’s a lot of overheating that is occurring inside of your computer case, this might also cause problems with disks so you want to be sure that you keep your temperatures as low as possible. Also check your power supply– make sure that your hard drive is receiving enough power. This might also become an issue if you’ve added new components to your computer– you might be overloading the capabilities of your existing power supply. And, ultimately, run some hard drive diagnostics. Your computer manufacturer or hard drive manufacturer will have some programs you can run to test every sector that’s on that drive to see just how operation on what might be.
If you’re having a problem when you’re booting your computer, it’s helpful to know if the problem is with your hard drive hardware or if the problem is with software. If you’re seeing that the drive is not recognized at all– you’re not getting any access lights to the drive, maybe some beep messages on the BIOS when it starts up, or you’re seeing an error message that says that the drive is not accessible– then you probably have some type of hardware issue. If you’re getting a message that says that the operating system is simply not found, that means that at least the hard drive is there and your system is able to access it, but the Windows operating system and the files it’s looking for are not located on that drive.
To troubleshoot a boot failure, we want to, of course, check and make sure that everything is connected properly, maybe reset some of those connections to our hard drives, and then check the boot sequence in our BIOS. We want to see if there are any removable disks that might be connected to our computer but might be booting first. You might see that a USB drive is connected and your system is trying to boot from that USB drive instead of the hard drive that’s in your computer. You might also want to check to see if any of your storage devices might be administratively disabled in the BIOS. If this is a brand new installation, check the hardware configuration. Check to see if you have both data and power cables for your hard drive. And you might want to try a different SATA interface on your motherboard to see if the problem follows the drive, or if it’s on a particular interface. And if you want to determine if the issue is with the motherboard or with the drive, you might want to disconnect that drive and take it to a known good computer for testing.
If you have a system configured with a RAID array then you have multiple drives that you have to troubleshoot. The first thing to check is to see if the RAID controller itself is operating properly. This is a separate piece of hardware that’s designed to provide rate capabilities for your computer. If there are problems with the adapter card not being seated properly, or you’ve got a bad adapter card, you’ll need to first replace that before looking at any problems with any drives.
Every RAID configuration is a little different in the way that it operates and the way that it provides information about the status of that RAID array. You want to go into the software that determines how the RAID array is performing to see exactly what the status might be. You might be able to tell very easily if the RAID is working properly or not, or you may be able to identify very quickly which exact disk might be having a problem in this entire array.
If you do have a bad drive then the process to resolve the problem is going to be different depending on what type of RAID you are using. If using RAID 0 then you have two or more disks that are going to be used in that array, and any single drive failure is going to break the entire array, and you’re going to lose that data. The only way to recover from RAID 0 is then to replace the bad drive and restore everything from a backup. If you’re running RAID 1 you, of course, are running two or more disks, and the RAID will continue to work even if you have a drive go bad. That’s because, with RAID 1, your information is mirrored. To resolve the issue all you need to do is replace the bad drive, and the good drive will copy over and sync up everything that’s on that new drive.
If you’re running RAID 5 then you’re using three or more disks as part of that RAID array. All drives need to be operational except for one– that means that if you do have a drive go bad, you simply pull out the bad drive and replace it, and the ray to ray will rebuild itself. And if you’re running RAID 10, or RAID 1 plus 0, That means you’re using four or more disks as part of that RAID array. You could lose all but one from each set of mirrors. RAID 1 plus 0, remember, we have RAID drives that are mirrored and we are striping across all sets of those mirrors. That means that we could lose this drive, we could lose this drive, and we could lose this drive, and because there’s a mirror for each of those, your RAID array would still be up and running if you lost those three specific drives.
Sometimes it’s your storage subsystem that is causing your operating system to either stop completely or cause extensive delays. You might see a Windows stop error– like this one here– or you might have the Apple spinning wait cursor that’s on your screen. This would indicate that your hard drive is having problems reading or writing information, so you want to make sure you have a backup and that you can perform a diagnostics of that storage device. One way to get some insight about how your drive is operating is by using SMART, or S-M-A-R-T. It stands for Self-Monitoring Analysis and Reporting Technology, and this is a function that’s built into a number of the drives that we use today. You can see an example of some of these smart attributes– the raw read error rate, a spin up time, a start/stop count, a seek error rate, a load cycle count– and I can see all of the statistics that are associated with each one of those variables on my hard drive. This is a way that you can help avoid the type of hardware failure when a device is slowly degrading over time. These smart errors can give you a little bit of a heads up so you know what drive to replace before it fails completely. You might want to schedule some disk checks. This is built into most RAID arrays so that it can check all of the different SMART statistics and tell you where a problem might be recurring. If you find that there is a problem with some of these drives you can then remove that drive and replace it before it actually fails, and sometimes that is the difference between having all of the data intact on your RAID array or losing everything and needing to recover from a backup.