What measures are being taken to avoid blue screens on satellites costing hundreds of billions of yen?



Once a satellite is launched, it can only be interfered with through communication, so if a fatal system error occurs, such as

a blue screen in Windows, there is a risk that the communication function will stop and it will not be possible to recover. Satellite systems engineer Clark Wakeland explained how to deal with such failures.

How to avoid a BSOD on your 2 billion dollar spacecraft | Clark Wakeland
https://clarkwakeland.com/blog/2024/avoiding-a-BSOD-on-your-satellite/



The OS of the satellite that Wakeland is involved in is a custom OS called Flight Software (FSW) written in C, but it is said that the response to a fatal error is done by rebooting the system in the same way as with other OSs. However, it is impossible to go to a satellite flying in space and operate the power.

That's where the watchdog timer comes in. By incorporating a 'watchdog timer reset' as part of the system's operation, the watchdog timer will be triggered when the system stops, allowing the system to perform recovery operations.

In the case of Wakeland's satellites, if the watchdog timer goes unreset for about 30 seconds, the system goes into safe mode, shutting down most functions and allowing the satellite to focus on turning its solar panels to the sun to generate power and trying to re-establish any lost communications.

Wakeland also talked about his experience with the safe mode occurring during a closed-loop test, a test of all systems at the end of the satellite's development. As the name suggests, a closed-loop test is a test that checks the control of the satellite through a closed loop in which simulated orbital data is sent to the system and new orbital data is simulated from the attitude response data.



Due to a delay in deciding the memory address to be used to send back response data from the satellite in the closed-loop test, the memory address from the previously developed satellite was used as a temporary one during development. The fact that it was a temporary one was forgotten and overlooked in the review meeting, so when the closed-loop test was performed, the address did not exist, causing a null pointer exception and causing the system to crash.

Even when the system was in safe mode, dozens of commands had to be entered correctly to restart the system properly. It took about 12 hours of troubleshooting to resolve the issue, and because the system was so expensive, Wakeland said that if the system went into safe mode, even for testing purposes, it had to be reported to the US government, which was the client.

in Software,   , Posted by log1d_ts