Saturday, February 28, 2026

Ironically SMART

A decade ago, an evil mastermind 1 tricked me into writing an application to display the SMART data from disk drives. The application came together relatively quickly and provides features of more mature products by using native FreeBSD functionality. The GEOM framework allows specifying the drive name without the full device path (e.g., ada0). The CAM library provides a unified way to pass arbitrary commands to ATA, SCSI, and NVMe drives. The XO library provides application output as text, XML, JSON, or HTML. One of the goals was to make parsing the output easy. For example, tab delimited columns for shell scripts and JSON documents for languages with support. XO enables adding this to an application trivially. The library is easy to use and powerful. But with great power comes great responsibility. Which is why I should not have been allowed to use it without adult supervision as a recent issue demonstrated. To be clear, the issue was with my understanding of SMART and not libxo.

libxo JSON contains unescaped tab characters causing invalidity #10

When running with --libxo json output, the JSON is invalid due to unescaped
tab characters in the threshold field:

"threshold":"50	100	100	0"

To understand why this happened, let's dive into what SMART data actually is. Self-Monitoring, Analysis, and Reporting Technology or "SMART" is data disk drives provide to gauge their health and reliability. The term originated with ATA drives, but most disk protocols provide some variant of this functionality. I applaud the ATA folks for giving system administrators a fighting chance in the drive failures war. They should get side-eye for specifying the format returned by the drive as

Table 35 from ATA/ATAPI-5 specification

This begs the question: how does an application like smartctl display output for drives if the data is vendor specific? First, the actual level of anarchy among drive vendors is smaller than the table above suggests 2. In practice, many of the attribute IDs as well as their data representation (i.e., which bytes and in what order) are standard for a particular vendor (e.g., attribute ID 1 is the same for all Seagate drives) and occasionally, also between vendors. While this helps stem the chaos, the main mechanism in smartmontools is an internal database of drive models, mapping the attribute ID values to names and describing the attribute's data representation. This information presumably comes from the published disk drive specifications.

To understand the data format better, it is useful to look at how smartctl represents each attribute internally:

/* ata_smart_attribute is the vendor specific in SFF-8035 spec */
#pragma pack(1)
struct ata_smart_attribute {
    unsigned char id;
    unsigned short flags;
    unsigned char current;
    unsigned char worst;
    unsigned char raw[6];
    unsigned char reserv;
} ATTR_PACKED;
#pragma pack()

This is the C/C++ structure the program uses to represent the up to 30 attributes in the ATA SMART Read Data command response. Compilers are free to change the alignment of structure members, typically for performance reasons. The variants of "pack" in this structure tell compilers to use the structure exactly as written. Seeing as this represents data returned by hardware, a "packed" data structure makes sense.

The bug mentions the "threshold" field, but threshold doesn't appear in the specification or C structure. This stems from a misunderstanding on my part of SMART and smartctl. The output from smartctl -A includes a description and column headers:

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID#  ATTRIBUTE_NAME  FLAG  VALUE  WORST  THRESH TYPE  UPDATED  WHEN_FAILED RAW_VALUE

The primary goal for my application was to  output the "raw" value of each attribute. The FLAG, VALUE, and WORST columns correlate to the flags, current, and worst fields in smartctl's ATA SMART attribute structure. At the time, my brain equated these to the "with Thresholds" description and output them when the user adds the --threshold option. What is the fourth value my application prints? That is the contents of the "reserv" field.

As the ATA specification punted on defining these, so did my application, opting instead to print the four, free-floating values. The unlabeled values is the root cause of the reported issue. Thus, the fix is straight forward; open a new XO container named "threshold" and print the (status) "flags", "nominal" (a.k.a., "current"), and "worst" values with their associated key name. This makes jq happy.

What happened to the fourth value, "reserv"? Originally I assumed it was a reserved byte, and in the spirit of "show me the data", the application printed it in the threshold section for ... reasons. But looking at various drive specifications, it appears that the "raw" value can be up to 7 bytes. Many (most? some?) attribute values are 6 bytes, but some attributes do use all 7 bytes. Now, instead of displaying the "reserv" byte with the threshold data, it is displayed as part of the raw data.

 



1 They are a lovely individual whom I cherish. That said, I cannot overlook their mustache twirling.
2 Levels of anarchy in the UK are higher if you never mind the bollocks.

No comments: