When devices fail, the problems can be numerous. In conversations with the embedded OEMs we work with, a common issue affects almost every manufacturer – the cost of diagnosing and fixing the causes of field failure. This impacts time-to-market and pulls resources away from development, to be used instead for field diagnostics and post-mortem analysis. It is therefore vital to correctly understand what is causing a device to fail.
Pinpointing the causes of field failure
When delving into the causes, a lot of the core issues can be traced to an oversight with regards to the importance of the file system. Issues such as critical data loss through power outages and badly worn-out flash can be effectively tackled through optimized file systems.
The associated cost in time and resources that results from device failure is especially relevant for the following reasons:
- The need for defect prevention during field operations: The high degree of reliability required for protecting critical data dictates that devices must not fail. Manufacturers are required to run extensive tests for a range of user scenarios to safeguard against edge cases. The analysis of test results can be a daunting task due to several interfaces between hardware, software, and application layers. Hence, there is a need to continuously track these interactions, so that any difference in the interactions can be quickly discovered and corrected.
- Vulnerability of device to wear-related failures: As flash media continues to increase in density and complexity, it’s also becoming more vulnerable to wear-related failures. With the shrinking lithography comes increased ECC requirements, and the move to more bits/cell. With this also comes a concern that what was written to the disk may not in fact be what is read off the disk. However, most applications assume that the data written to the file system will be completely accurate when read back. If the application does not fully validate the data read, there may be errors in the data that cause the application to fail, hang or just misbehave. These complications require checks to validate data read as against the data written, so as to prevent device failures due to data corruption.
- Complexity of hardware and software integration: The complex nature of hardware and software integration within embedded devices makes finding the cause of failures a painstaking job, one that requires coordination between several hardware and software vendors. For this reason, it often takes OEMs days to investigate causes at the file system layer alone. Problems below that layer can entail more extensive testing and involve multiple vendors. Log messages can help manufacturers pinpoint the location of failure so that the correct vendor can be notified.
As each of the three points have indicated, identifying failures more quickly is the key to reducing the cost. Understanding why your embedded design is failing could be related to any or all of the above issues.
The ability to pinpoint the cause of failure is especially helpful when an OEM is:
- Troubleshooting during the manufacturing and testing process to make sure that their devices do not fail for the given user scenarios.
- Doing post-mortem analysis on parts returned from their customers, in order to understand the reasons for failures, and possible solutions.
- Required to maintain a log of interactions between the various parts of the device, for future assistance with failure prevention or optimization.
Identifying the causes and costs of field failure is one thing, but what specific solutions can OEMs turn to in order to prevent these issues in the first place?
Fighting field failure with transactional file systems
As we mentioned, various file systems solutions exist for safeguarding critical data. Basic FAT remains a simple and robust option with decent performance. Unfortunately, it isn’t able to provide the degree of data integrity that is sometimes needed for safety-critical industries like automotive, aerospace, and industrial.
It bears repeating that embedded device fail-safety can be achieved with the right file system. Transactional file systems like Tuxera’s own Reliance Edge™ and Reliance Nitro™ offer carefully engineered levels of reliability, control, and performance for data that is simply too vital to be lost or corrupted. One of the key features of a high-quality transactional file system is that it never overwrites live data, ensuring the previous version of that data remains safe and sound. This helps preserve user data in the event of power loss.
In the video below, I demonstrate the data preservation differences between a FAT driver and Reliance Edge. The embedded device in the video has a power outage that causes severe image issues making the device nearly useless. Scrambled screens are just one example of what device failure looks like in the field. Imagine if that corrupted image is your system or BIOS update, and that device is storing data that is critical to your use case – a disastrous situation.
Final thoughts
Embedded device failure can cause significant resource costs and time to market delays for the manufacturer. The first step in correctly finding and identifying the causes of those failures involves understanding the ways that the device can fail – such as the level of flash wear on the device, the hardware and software integration, and how proper testing takes place.
The next step in tackling embedded device failures is understanding the role of a file system in securing your critical data, specifically against field failures and power loss. Selecting quality-assured transactional file systems are an effective way of doing that.
* This blog post was originally published in June 2020, and has been updated in November 2021.
Embedded device manufacturers – find out how file systems can help bulletproof your critical data.