[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

More examples of how not to do things



In case anyone had not noticed, NASA is still providing us with nice examples of how not to design concurrent real time systems. After the priority inversion on the last Mars lander, one might have hoped that they had changed their ways.

Adrian
--
Dr A E Lawrence
http://www-staff.lboro.ac.uk/~coael/
 On Sol 131, the rover encountered more trouble -- a software glitch
 -- and wound up taking an unplanned break that sol and the following
 sol. "It happened about 15 minutes after Spirit woke up to continue
 some driving," said Snyder. "The rover correctly responded to the
 unexpected reboot by halting further activities, recharging its
 batteries and then performing the next scheduled communications
 session. When the telemetry from this session was received, the
 flight software team quickly determined that the reboot occurred
 because an update to an area in memory was attempted, while access to
 that area was restricted. Flight software treats an attempt to update
 write-protected unrecoverable error and forces a reboot."

In order to update protected memory like this the flight software
follows a 3-step process: first the flight software disables the write
protection then it actually performs the update to the memory and then
it reestablishes write protection, Snyder explained. "The specific
update that failed happens to occur every time the rover wakes up --
and for all the 130 previous sols on Spirit and 109 sols on
Opportunity -- this particular update occurred several times each sol
without incident. However, this time that 3-step update process was
interrupted, thus when the memory was attempted to be updated it
failed leading to the reboot." Although this flight software handles
such interruptions routinely, "in this case, the software that ran
during the interruption changed that write protection access and that
is the flaw in the software that caused the problem," Snyder
explained.

Since the team was confident this is an extremely low probability
event, it has not adjusted the planning process to avoid the miniscule
period of vulnerability. (Opportunity has the same vulnerability to
the fault.) "Other than changing the software there is no guaranteed
method to prevent this thing from happening again," Synder
said. "However, because the window of vulnerability is small, this
3-step process must be interrupted precisely between steps 1 and 2,
and it must be the flawed software that does the interruption. The
operations team has decided to accept the risk that the problem could
reoccur and not patch the flight software for the time being."  While
most hills and mountains on Earth originate from tectonic motions or
volcanism, our planet also has some examples of hills that originated
from impacts of large meteorites, the predominant origin for hills and
mountains on the Moon. The grey hills in this image from Devon Island
in arctic Canada are material ejected from an impact about 20 million
years ago. The site is at 75 degrees north latitude. Researchers'
tents at the left give a sense of scale. Image Credit: NASA/JPL/ASU

The recovery from the anomaly occurred quickly and, Synder said, "the
health and safety of the rover was maintained throughout the event."
The software error left rover planners with some uncertainty about
Spirit's final position and attitude, however, so Sol 132 was spent
re-establishing that knowledge with imaging of the rover's
surroundings with the Pan Cam, NavCam, and hazard avoidance
camera. The unplanned break did have a silver lining though -- it
resulted in fully charged batteries, paving the way for yet another
long drive.

-----------------------------------------------------------------------

But the following day, Sol 136 - last Friday, Spirit encountered
another, different computer error. "It occurred during another routine
event, stopping imaging operations about eight minutes before a
scheduled communications sessions," Snyder confirmed. "The rover again
correctly responded to the reboot by halting further activities, but
now since it was so later in the afternoon, the scheduled
communications could not be performed and the rover shut down for the
evening."

The team spent Sol 137 reestablishing contact with the rover and
acquiring telemetry. "After we did acquire telemetry from 2 UHF passes
-- one with Mars Global Surveyor (MGS) and one with Odyssey -- the
flight software team quickly determined that this reboot occurred
because a command to the camera interface was attempted while imaging
operations were being terminated and the interface did not
respond. The software treats this bus error as another unrecoverable
error and forces a reboot," said Snyder.

$(B!H(BFor the other 135 previous sols on Spirit, and 114 previous
sols on Opportunity, imaging operation have been halted many times,
often multiple times per sol with no incident, "however, this time we
caught a bus error," Snyder said. "After review of the camera
interface power software, it was discovered that it was indeed
possible to cue up command to the camera interface, and then turn off
the interface power as part of stopping imaging. The cued command was
not cleared, so the result was a bus error. This too was a flaw."

As with first anomaly, the "window of vulnerability" to such an error
is very small, but -- unlike the first one -- there was a workaround
operationally so that the team members could allow any kind of cued
commands to clear out before turning off the interface power. Spirit
recovered from this anomaly relatively quickly, though two sols were
spent recovering communication, reestablishing position and attitude
knowledge, and acquiring additional telemetry, which means she lost
about three sols. Normal operations resumed on Sol 139 and Spirit
continued on her journey.

Are these two errors a sign of the dark days that will come for both
these rovers? "No," Snyder said. "We don't believe it's a suggestive
of deterioration. We think it's just unfortunate that we had two low
probability events happen in the same week. These two problems could
have happened in the beginning of the mission or middle or at any
time." Even though these specific errors were not due to age or
duration of mission, "the longer we operate the rovers and the more we
do with them, it is conceivable we could trip across problems of this
sort of other problems," Snyder added.