A lot of us have experienced a hard disk failure; some of us have even tried to find out more about the reliability of hard disks and their deeply hidden prediction function that's part of a technology called S.M.A.R.T.. One might argue that S.M.A.R.T. is not as reliable as it does not predict failure in all cases. This fact is partly true, but the actual inner workings of this self-monitoring system are not so simple. So let us try to remedy this uncertain situation and examine how S.M.A.R.T really works. Let's get started:
S.M.A.R.T - A Bit Of History
S.M.A.R.T., sometimes written as SMART (Self-Monitoring, Analysis and Reporting Technology) is a system that monitors the internal information of your drive. It was developed beginning with the year 1992 and now it is included in all modern IDE and Serial ATA hard disk drives (HDDs). Of course, it is a feature present also in solid state drives (SSDs).
Its history covers an array of names like Predictive Failure Analysis or IntelliSafe and input from all the major hard disk manufacturers: IBM, Seagate, Quantum, Western Digital. Finally, its documentation is featured for the first time in 2004 within the Parallel ATA standard and received regular revisions till 2008.
Its creation was based on the need to monitor a disk's status and it will supposedly tell you if the hard disk is about to, well, die. While you might think that S.M.A.R.T. will magically guess if the drive is healthy, actually it shows a series of variables whose number and type vary from drive to drive, that are indicators of its reliability. For a full list of attributes, as there are about 50 of them (for example, raw read error rate, spin up time, reported uncorrectable errors, power on time, load cycle count, etc.), can be accessed here.
Apart from some singular attempts (Google, Backblaze), most of the S.M.A.R.T. data is undocumented. The system provides a great deal of internal data but there are a lot of inconsistencies in the statistics mainly because many of the hard drive manufacturers use different definitions and measurements. For example, some manufacturers store power on times as hours, while others measure in minutes or seconds. Also, none of them try to explain which of the various attributes or variables are worth our attention, making us drown in data.
Before attempting to understand which of the attributes is relevant, we first have to differentiate between the main types of failures: predictable and non-predictable. The first type includes the failures that appear in time and that are caused by faulty disk mechanics or damages of the disk's surface. Problems get worse over time and the disk will eventually fail. Non-predictable failures are caused by sudden events, of which we can mention, for example, sudden power surges.
NOTE: It is important to understand that S.M.A.R.T. can only help you detect predictable failures.
Important S.M.A.R.T. Attributes
The health status of the hard disk is constantly monitored by the use of multiple sensors. The values are measured by the use of typical algorithms and then, the corresponding attributes are tweaked according to the results.
In any monitoring program you will use all S.M.A.R.T. attributes will have these fields:
- Identifier: the definition of the attribute. It usually has a standard meaning and it is marked with a number between 1 and 250 (for example 9 is Power-on Count). Still, all disk monitoring and testing tools provide the name and a textual description of the attribute.
- Threshold: the minimum value for the attribute. If this value is reached then your hard disk is about to fail.
- Value: current value of the attribute. The algorithm calculates this number based upon the raw data. A new hard disk will have a high number, the theoretical maximum (100, 200 or 253 depending on the manufacturer), that will decrease during the lifetime of the hard drive.
- Worst: the smallest value of the attribute ever recorded.
- Data: raw measured values provided by a sensor or a counter. This is the data used by the algorithm designed by the manufacturer of the HDD. Its contents depend on the attribute and the maker of the hard disk. Regular users should skip this one.
- Flags: the purpose of the attribute. This is usually set by the manufacturer and therefore will vary from disk to disk. Each of the attributes are either critical and can predict an imminent failure (for example ID 5 reallocated sectors count) or statistical with no direct effect on status (for example ID 174 unexpected power loss count).
NOTE: Always remember, when trying to understand the status of any S.M.A.R.T. attribute check the values of these three fields: value, threshold and flags. Also remember that smaller values are an indication of a decrease in reliability.
Disk temperature is a highly debated parameter. Still it is considered that values above 60°C will reduce the lifespan of an HDD and increase the probability of damage. We recommend using a fan to decrease the temperature of your hard disks and maybe prolong their life.
As we mentioned above, not all S.M.A.R.T. attributes are critical for failure prediction. The two above mentioned studies on hard drive failure rates and other sources agree that an important help in identifying failing drives is ID 5-reallocated sector counts. Reallocation happens when the drive's logic remaps a damaged sector, as a result of recurring soft or hard errors, to a new physical sector from its spare ones. This attribute reflects the number of times a remapping has happened and it is an indication of the HDDs surface wear.
Another useful attribute to be monitored is ID 197- Current Pending Sector Count. This counts the "unstable" sectors, meaning the damaged ones with read errors that are waiting a remapping, a kind of "probation" system. S.M.A.R.T.'s algorithms have mixed feelings about this particular attribute, as it is sometimes unconvincing, but it is stated that it can provide an earlier warning of possible problems.
The last of these indicators that must be monitored is ID 187- Reported Uncorrectable Errors. It is the count of errors that are impossible to recover and it is useful because it seems to have the same meaning for all manufacturers.
NOTE: All of the above mentioned S.M.A.R.T. attributes have a value that is easy to interpret. If they report a number of 1 or more, it is very possible that your drive is failing, so better start backing up. Still, even though these are useful indicators of drive reliability, they are not foolproof and you should compare them to the advice given by your chosen HDD monitoring tool.
This was our short study on the inner workings of S.M.A.R.T and its abilities to monitor and predict hard disk failures. The main point of view you should remember is that this self-monitoring system will help you review the status of your HDD. If you want to use this S.M.A.R.T data to see if your own drive has problems, read the articles recommended below.