Space-grade CPUs: How do you send more computing power into space?
21 - 27 minutes
Phobos-Grunt, perhaps the most ambitious deep space mission ever attempted by Russia, crashed down into the ocean at the beginning of 2012. The spacecraft was supposed to land on the battered Martian moon Phobos, gather soil samples, and get them back to Earth. Instead, it ended up helplessly drifting in Low Earth Orbit (LEO) for a few weeks because its onboard computer crashed just before it could fire the engines to send the spacecraft on its way to Mars.
In the ensuing report, Russian authorities blamed heavy charged particles in galactic cosmic rays that hit the SRAM chips and led to a latch-up, a chip failure resulting from excessive current passing through. To deal with this latch-up, two processors working in the Phobos-Grunt’s TsVM22 computer initiated a reboot. After rebooting, the probe then went into a safe mode and awaited instructions from ground control. Unfortunately, those instructions never arrived.
Antennas meant for communications were supposed to become fully operational in the cruise stage of Phobos-Grunt, after the spacecraft left the LEO. But nobody planned for a failure preventing the probe from reaching that stage. After the particle strike, the Phobos-Grunt ended up in a peculiar stalemate. Firing on-board engines was supposed to trigger the deployment of antennas. At the same time, engines could only be fired with a command issued from ground control. This command, however, could not get through, because antennas were not deployed. In this way, a computer error killed a mission that was several decades in the making. It happened, in part, because of some oversights from the team at the NPO Lavochkin, a primary developer of the Phobos-Grunt probe. During development, in short, it was easier to count the things that worked in their computer than to count the things that didn’t. Every little mistake they made, though, became a grave reminder that designing space-grade computers is bloody hard. One misstep and billions of dollars go down in flames.
Everyone involved had simply grossly underestimated the challenge of carrying out computer operations in space.
Why so slow?
Curiosity, everyone’s favorite Mars rover, works with two BAE RAD750 processors clocked at up to 200MHz. It has 256MB of RAM and 2GB of SSD. As we near 2020, the RAD750 stands as the current state-of-the-art, single-core space-grade processor. It’s the best we can send on deep space missions today.
Compared to any smartphone we wear in our pockets, unfortunately, the RAD750’s performance is simply pathetic. The design is based on the PowerPC 750, a processor that IBM and Motorola introduced in late 1997 to compete with Intel's Pentium II. This means that perhaps the most technologically advanced space hardware up there is totally capable of running the original Starcraft (the one released in 1998, mind you) without hiccups, but anything more computationally demanding would prove problematic. You can forget about playing Crysis on Mars.
Still, the price tag on the RAD750 is around $200k. Why not just throw an iPhone in there and call it a day? Performance-wise, iPhones are entire generations ahead of RAD750s and cost just $1k apiece, which remains much less than $200k. In retrospect, this is roughly what the Phobos-Grunt team tried to accomplish. They tried to boost performance and cut costs, but they ended up cutting corners.
The SRAM chip in the Phobos-Grunt that was hit by a heavily charged particle went under the name of WS512K32V20G24M. It was well known in the space industry because back in 2005, T.E. Page and J.M. Benedetto had tested those chips in a particle accelerator at the Brookhaven National Laboratory to see how they perform when exposed to radiation. The researchers described the chips as "extremely" vulnerable, and single-event latch-ups occurred even at the minimum heavy-ion linear energy transfer available at Brookhaven. This was not a surprising result, mind you, because WS512K32V20G24M chips have never been meant nor tested for space. They have been designed for aircraft, military-grade aircraft for that matter. But still, they were easier to obtain and cheaper than real space-grade memories, so the Russians involved with Phobos-Grunt went for them regardless.
"The discovery of the various kinds of radiation present in the space environment was among the most important turning points in the history of space electronics, along with the understanding of how this radiation affects electronics, and the development of hardening and mitigation techniques,” says Dr. Tyler Lovelly, a researcher at the US Air Force Research Laboratory. Main sources of this radiation are cosmic rays, solar particle events, and belts of protons and electrons circling at the edge of the Earth’s magnetic field known as Van Allen belts. Particles hitting the Earth’s atmosphere are composed of roughly 89% protons, 9% alpha particles, 1% heavier nuclei, and 1% solitary electrons. They can reach energies up to 10^19 eV. Using the chips not qualified for space in a probe that intended to travel through deep space for several years was asking for a disaster to happen. In fact, Krasnaya Zvezda, a Russian military newspaper, reported at that time that 62% of the microchips used on the Phobos-Grunt were not qualified for spaceflight. The probe design was 62% driven by a "let’s throw in an iPhone" mindset.
Radiation becomes a thing
Today, radiation is one of the key factors designers take into account when building space-grade computers. But it has not always been that way. The first computer reached space onboard a Gemini spacecraft back in the 1960s. The machine had to undergo more than a hundred different tests to get flight clearance. Engineers checked how it performed when exposed to vibrations, vacuum, extreme temperatures, and so on. But none of those testes covered radiation exposure. Still, the Gemini onboard computer managed to work pretty fine—no issues whatsoever. That was the case because the Gemini onboard computer was too big to fail. Literally. Its whooping 19.5KB of memory was housed in a 700-cubic-inch box weighing 26 pounds. The whole computer weighed 58.98 pounds.
Generally for computing, pushing processor technology forward has always been done primarily by reducing feature sizes and increasing clock rates. We just made transistors smaller and smaller moving from 240nm, to 65nm, to 14nm, to as low as the 7nm designs we have in modern smartphones. The smaller the transistor, the lower the voltage necessary to turn it on and off. That’s why older processors with larger feature sizes were mostly unaffected by radiation—or, unaffected by so-called single event upsets (SEUs), to be specific. Voltage created by particle strikes was too low to really affect the operation of large enough computers. But when space-facing humans moved down with feature size to pack more transistors onto a chip, those particle-generated voltages became more than enough to cause trouble.
Another thing engineers and developers typically do to improve CPUs is to clock them higher. The Intel 386SX that ran the so-called "glass cockpit" in space shuttles was clocked roughly at 20MHz. Modern processors can go as high as 5GHz in short bursts. A clock rate determines how many processing cycles a processor can go through in a given time. The problem with radiation is that a particle strike can corrupt data stored in an on-CPU memory (like L1 or L2 cache) only during an extremely brief moment in time called a latching window. This means in every second, there is a limited number of opportunities for a charged particle to do damage. In low-clocked processors like the 386SX, this number was relatively low. But when the clock speeds got higher, the number of latching windows per second increased as well, making processors more vulnerable to radiation. This is why radiation-hardened processors are almost always clocked way lower than their commercial counterparts. The main reason why space CPUs develop at such a sluggish pace is that pretty much every conceivable way to make them faster also makes them more fragile.
Fortunately, there are ways around this issue.
"In the old days, radiation effects were often mitigated by modifications implemented in the semiconductor process,” says Roland Weigand, a VISI/ASIC engineer at the European Space Agency. "It was sufficient to take a commercially available information processing core and implement it on a radiation hardened process.” Known as radiation hardening by process, this technique relied on using materials like sapphire or gallium arsenide that were less susceptible to radiation than silicon in the fabrication of microprocessors. Thus, manufactured processors worked very well in radiation-heavy environments like space, but they required an entire foundry to be retooled just to make them.
"To increase performance we had to use more and more advanced processors. Considering the cost of a modern semiconductor factory, custom modifications in the manufacturing process ceased to be feasible for such a niche market as space,” Weigand says. According to him, this trend eventually forced engineers to use commercial processors prone to single-event effects. "And to mitigate this, we had to move to alternative radiation-hardening techniques, especially the one we call radiation hardening by design,” Weigand adds.
The RHBD (radiation hardening by design) process enabled manufacturers to use a standard CMOS (Complementary metal–oxide–semiconductor) fabrication process. This way, space-grade processors could be manufactured in commercial foundries, bringing the prices down to a manageable level and enabling space mission designers to catch up a little to commercially available stuff. Radiation was dealt with by engineering ingenuity rather than the sheer physics of the material. "For example, Triple Modular Redundancy is one of the most popular ways to achieve increased radiation resistance of an otherwise standard chip,” Weigand explained. "Three identical copies of every single bit of information are stored in the memory at all times. In the reading stage, all three copies are read and the correct one is chosen by a majority voting.”
With this approach, if all three copies are identical, the bit under examination is declared correct. The same is true as well when just two copies are identical but a third is different; the majority vote decides which bit value is the correct one. When all three copies are different, the system registers this as an error. The whole idea behind the TMR is that copies are stored at different addresses in the memory that are placed at different spots on a chip. To corrupt data, two particles would have to simultaneously strike exactly where the two copies of the same bit are stored, and that is extremely unlikely. The downside to TMR, though, is that this approach leads to a lot of overhead. A processor has to go through every operation thrice, which means it can only reach one-third of its performance.
Thus, the latest idea in the field is to get space-grade processors even closer to their commercially available counterparts. Instead of designing an entire system on chip with radiation-hard components, engineers choose where radiation hardness is really necessary and where it can safely be dispensed with. That’s a significant shift in the design priorities. Space-grade processors of old were built to be immune to radiation. Modern processors are not immune anymore, but they are designed to automatically deal with all kinds of errors radiation may cause.
The LEON GR740, for example, is the latest European space-grade processor. It’s estimated to experience a staggering 9 SEUs a day on a geostationary Earth orbit. The trick is that all those SEUs are mitigated by the system and do not lead to functional errors. The GR740 is built to experience one functional error every 300 or so years. And even if that happens, it can recover just by rebooting.
Europe goes open source
The LEON line of space-grade processors working in SPARC architecture is by far the most popular choice for space in Europe today. "Back in the 1990s, when the SPARC specification was chosen, it had significant industry penetration,” says Weigand. “Sun Microsystems was using SPARC on their successful workstations.” According to him, the key reasons behind going to SPARC were existing software support and openness. "An open source architecture meant anybody could use it without licensing issues. That was particularly important since in such a niche market as space, the license fee is distributed among a very limited number of devices, which can increase their prices dramatically," he explains.
Ultimately, ESA learned about the issues with licensing the hard way. The first European space-grade SPARC processor—the ERC32, which is still in use today—was using commercial information processing cores. It was based on an open source architecture, but the processor design was proprietary. "This led to problems. With proprietary designs you usually don’t have access to the source code, and thus making the custom modifications necessary to achieve radiation hardening is difficult,” says Weigand. That’s why in the next step, ESA started working on its own processor, named LEON. "The design was fully under our control, so we were finally free to introduce all RHBD techniques we wanted."
The latest development in the line of LEON processors is the quad-core GR740 clocked at roughly 250MHz. ("We’re expecting to ship first flight parts towards the end of 2019,” Weigand says.) The GR740 is fabricated in the 65nm process technology. The device is a system-on-chip designed for high-performance, general-purpose computing based on the SPARC32 instruction set architecture. "The goal in building the GR740 was to achieve higher performance and capability to have additional devices included in one integrated circuit while keeping the whole system compatible with previous generations of European space-grade processors,” says Weigand. Another feature of the GR740 is advanced fault-tolerance. The processor can experience a significant number of errors caused by radiation and ensure uninterrupted software execution nonetheless. Each block and function of the GR740 has been optimized for best possible performance. This meant that components sensitive to single event upsets were used alongside the one that could withstand them easily. All SEU-sensitive parts have been implemented with a scheme designed to mitigate possible errors through redundancy.
For example, some flip-flops (basic processor components that can store either 1s or 0s) in the GR740 are off-the-shelf commercial parts known as CORELIB FFs. The choice to use them was made because they took less space on the chip and thus increased its computational density. The downside was that they were susceptible to SEUs, but this vulnerability has been dealt with by the Block TMR correction scheme where every bit read from those flip-flops is voted on by modules arranged with adequate spacing among them to prevent multiple bit upsets (scenarios where one particle can flip multiple bits at once). There are similar mitigation schemes implemented for L1 and L2 cache memories composed of SRAM cells, which are also generally SEU-sensitive. When the penalty such schemes inflicted on performance was eventually considered too high, ESA engineers went for SEU-hardened SKYROB flip-flops. Those, however, took twice the area of CORELIBs. Ultimately when thinking about space and computing power, there was always some kind of trade-off to make.
So far, the GR740 passed several radiation tests with flying colors. The chip has been bombarded with heavy ions with linear energy transfer (LET) reaching 125 MeV.cm^2/mg and worked through all of this without hiccups. To put that in perspective, feral SRAM chips that most likely brought down the Phobos-Grunt latched up when hit with heavy ions of just 0.375 MeV.cm^2/mg. The GR740 withstood levels of radiation over 300 times higher than what Russians had put in their probe. Besides a near-immunity to single-event effects, the GR740 is specced to take up to 300 krad(Si) of radiation in its lifetime. In the testing phase, Weigand’s team even had one of the processors irradiated to 292 krad(Si). Despite that, the chip worked as usual, with no signs of degradation whatsoever.
Still, specific tests to check the actual total ionizing dose the GR740 can take are yet to come. All those numbers combined mean that the processor working at the geostationary Earth orbit should experience one functional error every 350 years. In LEO, this time should be around 1,310 years. And even those errors wouldn’t kill the GR740. It would just need to do a reset.
"Space-grade CPUs developed in the US have traditionally been based on proprietary processor architectures such as PowerPC because people had more extensive experience working with them and they were widely supported in software,” says the Air Force Research Labs’ Lovelly. After all, the history of space computation began with digital processors delivered by IBM for the Gemini mission back in the 1960s. And the technology IBM worked with was proprietary.
To this day, BAE RAD processors are based on the PowerPC, which was brought to life by a consortium of IBM, Apple, and Motorola. Processors powering glass cockpits in the Space Shuttles and Hubble Space telescope were made in the x86 architecture introduced by Intel. Both PowerPC and x86 were proprietary. So in carrying with the tradition, the latest American design in this field is proprietary, too. Named High Performance Spaceflight Computing (HPSC), the only difference is that PowerPC and x86 were best known from desktop computers. The HPSC is based on the ARM architecture that today works in most smartphones and tablets.
The HPSC has been designed by NASA, Air Force Research Laboratory, and Boeing, which is responsible for manufacturing the chips. The HPSC is based on the ARM Cortex A53 quad-core processors. It will have two such processors connected by an AMBA bus, which makes it an octa-core system. This should place its performance somewhere in the range of mid-market 2018 smartphones like Samsung Galaxy J8 or development boards like HiKey Lemaker or Raspberry Pi. (That’s before radiation hardening, which will cut its performance by more than half that.) Nevertheless, we’re no longer likely to read bleak headlines screaming that 200 processors powering the Curiosity rover would not be enough to beat one iPhone. With the HPSC up and running, this is more likely to be three or four chips required to get iPhone-like computing power.
"Since we do not yet have an actual HPSC for tests, we can make some educated guesses as to what its performance may be like,” says Lovelly. Clock speed was the first aspect to go under scrutiny. Commercial Cortex A53 octa-core processors are usually clocked between 1.2GHz (in the HiKey Lemaker for example) and 1.8GHz (in the Snapdragon 450). To estimate what the clock speed would look like in the HPSC after radiation hardening, Lovelly compared various space-grade processors with their commercially available counterparts. "We just thought it reasonable to expect a similar hit on performance,” he says. Lovelly estimated HPSC clock speed at 500MHz. This would still be exceptionally fast for a space-grade chip. In fact, if this turned out to be true for the flight version, the HPSC would have the highest clock rate among space-grade processors. But more computing power and higher clock rates usually come at a dear price in space.
BAE RAD5545 is probably the most powerful radiation-hardened processor available today. Fabricated in the 45nm process, it is a 64-bit quad-core machine clocked at 466MHz with power dissipation of up to 20 Watts—and 20 Watts is a lot. A Quad Core i5 sitting in a 13-inch MacBook Pro 2018 is a 28 Watt processor. It can heat its thin aluminum chassis to really high temperatures up to a point where it becomes an issue for some users. Under more computationally intensive workloads, fans immediately kick in to cool the whole thing down. The only issue is that, in space, fans would do absolutely nothing, because there is no air they could blow onto a hot chip. The only possible way to get heat out of a spacecraft is through radiation, and that takes time. Sure, heat pipes are there to take excessive heat away from the processor, but this heat has to eventually go somewhere. Moreover, some missions have tight energy budgets, and they simply can’t use powerful processors like RAD5545 under such restrictions. That’s why the European GR740 has power dissipation at only 1.5 Watts. It’s not the fastest of the lot, but it is the most efficient. It simply gives you the most computational bang per Watt. The HPSC with 10 Watt power dissipation comes in at a close second, but not always.
"Each core on the HPSC has its own Single Instruction Multiple Data unit,” says Lovelly. "This gives it a significant performance advantage over other space-grade processors.” SIMD is a technology commonly used in commercial desktop and mobile processors since the 1990s. It helps processors handle image and sound processing in video games better. Let’s say we want to brighten up an image. There are a number of pixels, and each one has a brightness value that needs to be increased by two. Without SIMD, a processor would need to go through all those additions in sequence, one pixel after the other. With SIMD, though, the task can be parallelized. The processor simply takes multiple data points—brightness values of all the pixels in the image—and performs the same instruction, adding two to all of them simultaneously. And because the Cortex A53 was a processor designed for smartphones and tablets that handled a lot of media content, the HPSC can do this trick as well.
"This is particularly beneficial in tasks like image compression, processing, or stereo vision,” says Lovelly. "In applications that can’t utilize this feature, the HPSC performs slightly better than the GR740 and other top-performing space processors. But when it comes to things where it can be used, the chip gets well ahead of the competitors.”
Making space exploration sci-fi again
Chip designers in the US tend to go for more powerful, but more energy-hungry, space-grade processors because NASA aims to run more large-scale robotic and crewed missions compared to its European counterparts. In Europe, there are no current plans to send humans or car-sized planetary rovers to the Moon or Mars in the predictable future. The modern ESA is more focused on probes and satellites, which usually work on tight energy budgets, meaning something light, nimble, and extremely energy-efficient like the GR740 makes much more sense. The HPSC, in turn, has been designed from the ground up to make at least some of NASA’s at-times sci-fi ambitions reality.
Back in 2011, for instance, NASA’s Game Changing Development Program commissioned a study to determine what space computing needs would look like in the next 15 to 20 years. A team of experts from various NASA centers came up with a list of problems advanced processors could solve in both crewed and robotic missions. One of the first things they pointed to was advanced vehicle health management, which they deemed crucial for sending humans on long deep space missions. It boils down to having sensors constantly monitoring the health of crucial components. Fast processors are needed to get data from all those sensors at high frequencies. A sluggish computer could probably cope with this task if the sensor readouts got in every 10 minutes or so, but if you want to do the entire checkup multiple times a second to achieve something resembling real-time monitoring, the processor needs to be really fast. All of this would need to be devised to have astronauts seated in front of consoles showing the actual condition of their spaceship with voiced alerts and advanced graphics. And running such advanced graphics would also demand fast computers. The team called that "improved displays and controls.”
But the sci-fi aspirations do not end at flight consoles. Astronauts exploring alien worlds could likely have augmented reality features built right into their visors. The view of a physical environment around them will be enhanced with computer-generated video, sound, or GPS data. Augmentation would in theory provide situational awareness, highlighting areas worthy of exploring and warning against potentially dangerous situations. Of course, having the AR built into the helmets is only one possible option. Other notable ideas mentioned in the study included hand-held, smartphone-like devices and something vaguely specified as "other display capabilities" (whatever those other capabilities may be). Faster space-grade processors would be needed to power such computing advances.
Faster space-grade processors are meant to ultimately improve robotic missions as well. Extreme terrain landing is one of the primary examples. Choosing a landing site for a rover is a tradeoff between safety and scientific value. The safest possible site is a flat plane with no rocks, hills, valleys, or outcrops. The most scientifically interesting site, however, is geologically diverse, which usually means that it is packed with rocks, hills, valleys, and outcrops. So called Terrain Relative Navigation (TRN) capability is one of the ways to deal with that. Rovers equipped with the TRN could recognize important landmarks, see potential hazards, and navigate around them, narrowing down the landing radius to less than 100 meters. The problem is that current space-grade processors are way too slow to process images at such a rate. So the NASA team behind the study ran a TRN software benchmark on the RAD 750 and found the update from a single camera took roughly 10 seconds. Unfortunately, 10 seconds would be a lot when you’re falling down to the Martian surface. To land a rover within 100-meter radius, an update from a camera would have to be processed every second. For a pinpoint, one meter landing, estimates would need to come at 10Hz, which is 10 updates per second.
Other things on NASA’s computational wishlist include algorithms that can predict impending disasters based on sensor readouts, intelligent scheduling, advanced autonomy, and so on. All this is beyond the capabilities of current space-grade processors. So in the study, NASA engineers estimated how much processing power would be needed to efficiently run those things. They found that spacecraft health management and extreme terrain landing needed between 10 and 50 GOPS (gigaoperations per second). Futuristic sci-fi flight consoles with fancy displays and advanced graphics needed somewhere between 50 and 100 GOPS. The same thing goes for augmented reality helmets or other devices; these also consumed between 50 and 100 GOPS.
Ideally, future space-grade processors would be able to power all those things smoothly. Today, the HPSC running at a power dissipation between 7 and 10 Watts can process 9 to 15 GOPS. This alone would make extreme landing possible, but the HPSC is designed in such a way that this figure can go up significantly. First, those 15 GOPS do not include performance benefits that the SIMD engine brings to the table. Second, the processor can work connected to other HPSCs and external devices like special-purpose processors, FPGAs, or GPUs. Thus, a future spaceship can potentially have multiple distributed processors working in parallel with specialized chips assigned to certain tasks like image or signal processing.
No matter where humanity’s deep space dreams go next, we won’t have to wait that long for engineers to know where the current computing power stands. The LEON GR740 is scheduled for delivery to ESA at the end of this year, and after a few additional tests it should be flight ready in 2020. The HPSC, in turn, is set for a fabrication phase that should begin in 2021 and last until 2022. Testing is expected to take a few months in 2022.
NASA should get flight-ready HPSC chips by the end of 2022. That means, all other complicating timeline factors aside, at least the future of space silicon appears on track to be ready for spaceships taking humans back to the Moon in 2024.