Is This What a Defective Hard Drive Behaves Like? (The Case Of The Random Workstation Hangs)
Posted by William Diaz on August 3, 2011
The nature of this problem made it difficult or impossible for the help desk to identify because there was nothing to look at that that would tell the technician what was happening when these calls started coming in. They started as a complaint of general system slowness at random times through out the day and were often being assigned to be looked at overnight, which resulted in zero findings because no one knew what they were looking for and could not experience the issue remotely. And if they did, in fact run across the issue while logged on, they could not do anything anyway because the issue of the stalled workstation appeared as a remote connectivity problem and not necessarily a local hardware issue with the workstation.
As I started to here about these issues, I became interested and kept an ear out for a user or two who was encountering the random hang. Identifying a workstation with the problem actually became rather simple because during the hang, a very specific series of events would kick off after the system resumed from the hung state. Isolating the cause, though, was a lot more involved. That’s because the nature of these issues is often software based, e.g. a system or application process was kicking off, or some low level driver was locking up the system. To assist me in that task of finding the culprit, I used a few tools, starting with the Windows XP Event Viewer, then moving to Process Monitor to collect process trace logs, WinDbg to examine manual crash dumps of the hanging system, Performance Monitor, and finally installing Windows 7 after all else failed to take advantage of its enhanced Event Tracing.
Some background. The workstation hangs for the most part coincided with the then recent deployment of new Dell Optiplex 960 and 980 workstations. The hangs were not “hard hangs”, a type of hang where the system becomes completely unresponsive and needs to be manually rebooted. The hangs being seen could be characterized as “soft” in that the workstation would eventually recover after a certain amount of time, usually between 2-5 minutes. During the hang, the mouse was still active but switching between applications was not possible and all keystrokes or commands became queued during the hang. Once the system recovered, any pending operations were executed immediately afterwards. There was no rhyme or reason to the hangs, they were entirely random and would happen several times a day while any user was logged on.
I connected to the workstations after hours and examined the event logs for anything out of the ordinary. Normally, I am looking for error’s or warnings, and I was specifically focused on the System logs, hoping to see disk warnings indicating there were bad blocks on the hard drive. Not seeing anything there, I turned to the application logs but didn’t see anything that stood out there either. Looking at the other workstation, too, did not reveal anything telling.
With nothing to go on, I turned to the generic Information events and noticed that after each reported instance of hang there were a slew of McLogEvent 257 events:
Although we began to focus on the anti-virus initially as a suspect, in hindsight the anti-virus was actually a victim. These events were basically telling us that local system files could not be scanned because the system was too busy, or too backed up, to process any file disk IO. After 90 seconds, the anti-virus would give up and suspend the scan operation of the local files it was scanning in the background and through this event. The obvious point of contention was the hard disk or disk subsystem.
Thinking that maybe some and large amount of disk write and reads were responsible, I left process monitor running on one of the workstations that was having the issue, routinely clearing the trace until the issue manifested itself. After saving the log, I opened it but only saw a small number of read/writes for all processes, including system, not nearly enough to account for the a hang of several minutes.
Eventually, I turned to Performance Monitor. The major benefit here was that I could remotely connect perfmon to the problem workstation and monitor various system counters in realtime. A task we could not do on the system itself since it was hung and could not display any performance data. I added a few key counters: CPU usage, Available Memory, and various Disk counters: %Disk Time, %Disk Read Time, %Disk Write time, and Current Disk Queue Length. Below is a sample of these counters when the system would hang.
Disk read and write time also spiking to 100% with Processor usage as the lower red line to show no excessive CPU usage:
Here is the most telling result, though, the Current Disk Queue Length is huge. Think of this as a traffic jam of file disk IO waiting to be processed. Because the disk is “off to the races”, the system is stuck waiting for the disk to return to normal and process this backlog of pending disk IO:
To sum it up, the hard drives on these systems just started to spin up for no known reason for any random number of minutes, and this was beginning to look more and more like hardware issue. This perfmon trace and several others were captured when the workstation was not under any significant load (or any load for that matter).
After receiving a couple of these (I actually inherited one and took on the other to test a proof of concept) I ran a series of disk stress tests and surface scans but nothing turned up, and with Windows XP unable to report any performance issues, I decided to install Windows 7 in hopes that its much better and more robust set of system tools could detect the problem. After installation completed (not an image but a “vanilla” of the box installation), I started to setup the environment by installing a few apps, and, in no time, the workstation encountered its signature hang. Again, I turned to the vent viewer and, whereas XP did not log an event, Windows 7 did: “Windows: A request to write to the file “file path…” at offset … bytes exceeded, but took an abnormally long time (# seconds) to be serviced by the OS. This problem is likely due to faulty hardware…”
Reported as ESENT ( Extensible Storage Engine) source 508, this is a clear indication of a hard disk, disk controller, or some other storage component problem. You can see the system “took an abnormally long time (178 seconds) to be serviced by the OS”, which coincided precisely with the hang time in this instance. To narrow down the issue to the disk or disk controller, I took a spare hard drive of a different model and make, imaged it, and ran on the problem workstation for several days, with no issues. Then, we contacted Dell and asked them to send us an identical drive and the new drive performed with no issues.
With this, I was certain that the issue was with the hard disk itself. In all cases, the problem drives were WD VelociRaptor drives of the same model, and my conclusion was that the soft hangs were the result of a bad batch of hard drives drives shipped with the systems we started receiving. In all the cases where the drive was replaced, the issue did not return.