Fault Management

Introduction

The Fault Management subsystem for the Safety Island provides a mechanism for capturing, reporting and collating faults from supported hardware in safety-critical designs.

The subsystem interfaces with the following types of devices:

A fault device, which reports faults from its safety mechanisms. It may also report faults originating from other fault devices to support the creation of a fault device tree.

A safety state device, which manages a state in reaction to reported faults.

Supporting driver implementations are provided for the following Arm hardware designs:

A Device Fault Management Unit (Device FMU): a fault device attached to a GIC-720AE interrupt controller.

A System Fault Management Unit (System FMU): a fault device which collates faults from upstream FMUs.

A Safety Status Unit (SSU): a safety state device which manages a state machine in response to faults in a safety-critical system.

Faults

A unique fault (i.e. generated by a specific safety mechanism and reported by a fault device implementation) is represented by the subsystem and driver interfaces as a device-specific 32-bit integer along with a handle to the originating device.

A fault may be critical or non-critical and this affects how it is processed by the subsystem.

Fault Device Trees

The subsystem is configured with a list of “root” fault devices - those located at the root of a fault device tree. Root fault devices are typically collators of faults from multiple upstream fault devices (possibly recursively) and may also directly affect the state of a connected safety state device.

The diagram below shows an illustrative fault device tree. (For the simpler Kronos topology, see Kronos Deployment below.)

Safety States

The SSU state machine has 4 safety states:

TEST: Self-test

SAFE: Safe operation

ERRN: Non-critical fault detected

ERRC: Critical fault detected

Control signals from software:

compl_ok: Diagnostics complete or non-critical fault cleared

nce_ok: Non-critical fault diagnosed

ce_not_ok: Critical fault diagnosed

nce_not_ok: Non-correctable non-critical fault

Control signals connected in hardware to the root fault device:

nc_error: Non critical error

c_error: Critical error

reset

TEST is the initial state on boot. The software is responsible for transitioning to SAFE after the successful completion of a self-test routine. ERRC represents a critical system failure, which can only be recovered by resetting the system. A non-critical fault causes a transition to ERRN, which can either be recovered back to SAFE or promoted to ERRC by the software.

The diagram below shows all the possible transitions between these states using these signals.

Finite State Machine (FSM) States and Transitions:

From reset the FSM defaults to the TEST state.

It shall stay in this state until SW has completed any power up tests. If the SW controlled tests pass then a write can be issued indicating that to move the FSM to the SAFE state.

If the tests fail then a write can be issued to move the FSM to the ERRN state, indicating that an error has occurred that may be resolvable.

After further tests if the SW can issue a write depending on whether it was determined the error has been resolved or not, moving the FSM to SAFE it was resolved or ERRC if it was not.

When in SAFE mode the FSM can only be moved after either:

a reset moving it back to TEST

a non-critical error interrupt moving it to ERRN

a critical error interrupt moving it to ERRC

if a critical and non-critical error occur in the same time the critical error takes precedence and the FSM shall move to ERRC

Design

The Fault Management subsystem for the Safety Island implementation and functionality are grounded in the Zephyr real-time operating system (RTOS) environment.

Drivers

Driver interfaces are provided for fault devices and safety state devices. Specific driver implementations with devicetree bindings are provided for the Arm FMU and Arm SSU.

The public driver interfaces are described under components/safety_island/zephyr/src/include/zephyr/drivers/fault_mgmt

The drivers are instantiated in the devicetree using bindings under components/safety_island/zephyr/src/dts/bindings/fault_mgmt

Fault Management Unit

The FMU driver is an implementation of a fault device. Inside the driver, one of two driver implementations is selected at runtime to handle differences between the GIC-720AE and the System FMU programmers’ views.

It is expected that interrupts are only defined for root FMUs. If the root FMU is a System FMU, it will collate faults from multiple upstream sources. The driver in this case will inspect the status of other FMUs in the tree when a fault occurs to determine the exact origin and cause of the fault.

The FMU driver allows a single callback to be registered, through which incoming faults are reported.

Safety Status Unit

The SSU driver is an implementation of a safety state device. It implements the safety state device interface which allows its state to be read and controlled.

Subsystem

The Fault Management subsystem manages two fault-handling threads (one for critical faults and another for non-critical faults), which listen for queued faults from any configured root fault device and forward them to all configured fault handlers.

Multiple fault handlers can be statically registered (using the FAULT_MGMT_HANDLER_DEFINE macro), each of which is called once per root fault device on initialization, then once per reported fault. Handlers are registered with a unique priority that determines the order in which they are called.

Certain subsystem features are themselves implemented as handlers. It is expected that in order to implement a Fault Management policy for a safety-critical system design, one or more additional custom fault handlers would be required to perform tasks such as:

Configuring the criticality and enabled state of fault device safety mechanisms.

Performing a self-test routine before notifying the safety state device that the system is safe for operation.

Reacting to non-critical faults and deciding whether to perform a corrective action to reset the safety state or promote to a critical fault. This decision may be based on the provided fault count storage.

The subsystem has configuration options to manage the stack space, priority and queue size of both threads, which should be tuned and validated according to deployment requirements. Specifically, more complex custom handlers may require more stack space as they are called on the subsystem threads.

The public interface for the subsystem and its components is described under components/safety_island/zephyr/src/include/zephyr/subsys/fault_mgmt

Safety component

The safety component contains additional interfaces to facilitate reading and updating a system’s safety state.

If enabled, this component requires (and validates at boot) that all root fault devices have an attached safety state device.

Storage component

The storage component manages historical counts per safety mechanism per fault device.

Two storage backends are provided:

Trusted Firmware-M PSA Protected Storage Interfaces, with an in-memory cache populated at boot.

A non-persistent in-memory implementation, using only Zephyr’s sys_hash_map.

For the PSA backend, there are configuration options to manage the storage key and the maximum record count, which should be tuned and validated depending on the number of distinct faults and devices and/or other system constraints.

Kronos Deployment

The Kronos FVP models:

An SSU in the Safety Island.

A System FMU in the Safety Island, attached to the SSU.

An FMU attached to the GIC-720AE in the Primary Compute, attached to the System FMU.

The Kronos Fault Management application (components/safety_island/zephyr/src/apps/fault_mgmt) provides Kconfig and devicetree overlays for a sample deployment using these devices on Safety Island Cluster 1. The functionality can be evaluated using the Zephyr shell on this cluster. Additionally, this application serves as the basis for the automated validation (see Integration Tests Validating the Fault Management Subsystem).

For fault count storage, the application uses the PSA Protected Storage implementation provided by TF-M. CONFIG_MAX_PSA_PROTECTED_STORAGE_SIZE is configured according to TF-M storage constraints.

Validation

The Kronos Reference Design contains integration tests for the overall FMU and SSU integration, described at Integration Tests Validating the Fault Management Subsystem

Shell Reference

The subsystem provides an optional shell command (enabled using CONFIG_FAULT_MGMT_SHELL) which exposes the subsystem API interactively for evaluation and validation purposes. Its sub-commands are described below.

fault tree - Print a description of the fault device tree (including any safety state devices) to the console. The device names printed here can be used in the other commands below.

fault inject DEVICE FAULT_ID - Inject a specific FAULT_ID into DEVICE. The resultant fault will be logged on the console.

fault set_enabled DEVICE FAULT_ID ENABLED - Enable or disable a specific FAULT_ID on a DEVICE. Set ENABLED to 1 to enable or 0 to disable.

fault set_critical DEVICE FAULT_ID CRITICAL - Configure a specific FAULT_ID on a DEVICE as critical or non-critical. Set CRITICAL to 1 to set as critical or 0 to set as non-critical.

The FAULT_ID above refers to a 32-bit integer whose valid values are device-specific (e.g. 0x100 represents an APB access error for a System FMU but a GICD Clock Error for a GIC-720AE FMU) and opaque to the driver itself.

The following are only available if CONFIG_FAULT_MGMT_SAFETY is enabled:

fault safety_status DEVICE - Print the current status of safety state DEVICE to the console.

fault safety_control DEVICE SIGNAL - Send SIGNAL to safety state DEVICE.

The following are only available if CONFIG_FAULT_MGMT_STORAGE is enabled:

fault list [THRESHOLD] - List all reported fault counts. The optional THRESHOLD filters out faults below a certain count.

fault summary - Show a more detailed summary of the fault counts, including a list of the most reported faults.

fault count - Print the total count of reported faults.

fault clear - Reset all fault counts back to zero.

The test suite at yocto/meta-kronos/lib/oeqa/runtime/cases/test_10_fault_mgmt.py demonstrates usage of these sub-commands.

Safety Considerations

The Fault Management subsystem has the following features to mitigate the risks of unexpected runtime behavior causing a denial of service:

Iterative methods that take a fixed amount of stack space based on CONFIG_FAULT_MGMT_MAX_TREE_DEPTH are used to traverse fault device trees.

Invalid combinations of configuration values (e.g. a root FMU without IRQ numbers) are detected at compile time where possible.

The subsystem functionality is composed of independent handlers which can be disabled if not required.

Note that there are conditions where the subsystem will panic and the application running on the Safety Island cluster will stop processing further faults (non-exhaustive):

Faults arrive more quickly than they are handled over a long enough period for a queue to fill up.

A fault arrives at a System FMU from an unknown Device FMU.

The number of stored fault records exceeds the amount of available storage.

An unexpected error code is returned when attempting to write a fault count to the storage.