What is the 'fail-silent architecture' that NASA used to ensure the computers on the Artemis II spacecraft never malfunction?

How NASA Built Artemis II's Fault-Tolerant Computer – Communications of the ACM
https://cacm.acm.org/news/how-nasa-built-artemis-iis-fault-tolerant-computer/

The 'Apollo Guidance Computer,' used by astronauts in the 1960s Apollo program , was an innovative system for its time, featuring a 1MHz processor and approximately 4KB of rewritable memory. However, its role was limited, and many of its crucial functions, such as environmental control and power management, relied on manual or electromechanical mechanisms like switches and relays.
Unlike the Apollo guidance computer, the Orion spacecraft, part of the Artemis II mission which will orbit the moon with a crew of four for the first time in over 50 years, has a computer system that comprehensively manages almost all safety-critical functions, from life support to communication control. In space, on-site repairs and emergency landings are virtually impossible, and any malfunction would immediately mean mission failure. Therefore, the computer on Orion is designed to withstand radiation-induced bit flips and hardware failures, and continue to operate without interruption, earning it the title of 'the most fault-tolerant computer system ever developed for spaceflight.'

Because outer space is a highly radiation-filled environment where high-energy particles can affect electronic equipment and cause errors in calculations, NASA has taken measures that go beyond the conventional 'triple redundancy' of the system to prevent incorrect answers from being transmitted to the spacecraft's thrusters. Orion uses two Flight Control Modules (FCMs), each containing two flight control modules, for a total of four FCMs. Furthermore, each FCM consists of two processors that monitor each other, resulting in a total of eight CPUs running the same flight software in parallel.
Nate Huytenbrook, head of software integration and verification for NASA's Orion program, explains, 'We continue to design to cope with hardware failures. In addition to physically redundant wiring, we have multiple logical network systems in place, and the flight computer itself is also redundant. All of this is to cope with hardware failures.' According to Huytenbrook, if the computer makes a calculation error due to the effects of radiation or other factors, it will fail silently without sending the wrong answer, which is why this design is called 'fail-silent.'

Synchronizing multiple computers perfectly, as in fail-silent systems, is an extremely difficult challenge in computer science because even a slight timing difference can cause perfectly functioning systems to produce different results. NASA addresses this problem by employing a 'strictly deterministic architecture.'
Furthermore, hardware measures have also been strengthened. The memory employs 'triple modular redundant memory' that automatically corrects single-bit errors with each read operation. The network is also composed of multiple paths, and if a bit inversion is detected during communication, that path is immediately disabled.
Even with this fail-silent architecture that enhances error avoidance through automated checks, there is still a possibility of a software bug or a catastrophic event called a 'common mode failure' that could theoretically affect all primary channels simultaneously. To mitigate this risk, Orion is equipped with completely independent backup flight software (BFS) that automatically takes over control if the main system fails, ensuring the mission can be safely transitioned to a safe state. Even in the event of a 'dead bus' state where power is completely lost, Orion is designed to automatically transition to safe mode, stabilize its attitude, restart power generation by orienting its solar panels towards the sun, and attempt to re-establish communications, thus ensuring survival.
According to Communications of the ACM, while the Apollo era relied heavily on mechanical backups and computers weren't the sole determining factor for survival, modern spacecraft rely almost entirely on software for thermal control and power management. Therefore, maintaining software synchronization and effectiveness in radiation-exposed environments is a challenge. NASA employs large-scale verification workflows, such as the Monte Carlo method , to ensure this reliability. The 'zero-tolerance architecture' used in Artemis II, which ensures the spacecraft never stops even if the software outputs an error, is expected to have potential applications in future ground systems such as autonomous vehicles and power infrastructure.
Related Posts:
in Software, Posted by log1e_dh







