Broken Space Toilet: Piss-Poor Redundant System Design on the International Space Station
The Space Shuttle Discovery has just blasted off for the International Space Station with the billion-dollar Kibo Japanese science lab inside. Perhaps even more important, the shuttle also carries emergency repair parts for the International Space Station’s solitary toilet, which has partially failed. It can no longer collect “number 1” reliably although it’s still OK with “number 2.” Now it seems to me that when you send people into space, there are some things that absolutely, positively cannot fail. Those are called “life-support systems” and include things like oxygen generators, CO2 air scrubbers, water recyclers and purifiers, and toilets. People need to take things in and they need to get rid of some things as well. Otherwise, they get ill and die.
So it came as a big surprise to me to find out that there’s only one toilet on the International Space Station at the moment and that’s been the case for the last seven years. Apparently, it’s broken before. The astronauts on the International Space Station can use the port-a-potty on the always-attached Soyuz capsule and they can use other stopgap measures, but it appears that this particular critical life-support function has no redundant backup. I cannot classify this as anything but piss-poor system design. (Sorry, I absolutely could not resist that one.)
When you’re in space, you cannot just run to the nearest Home Depot in LEO (low Earth orbit) for replacement toilet parts like a flapper or a ball-cock valve. And I’ve traveled enough to have seen (and fixed) my share of broken, leaking toilets in the hotels I’ve stayed at to know that toilets are pretty unreliable things, although I’ve no familiarity with $19 million toilets like the one installed on the International Space Station.
We in the SOC design business are now facing a similar sort of systems engineering problem. At today’s smallest lithographies, the chances of building perfect chips are getting slimmer and slimmer. At the same time, the chances of these same parts failing in the field are increasing. Yet we still create SOC system designs that assume perfection. We assume that imperfect parts will be culled through testing and that a certain level of field failures is acceptable. This is part of an underlying design mentality established over decades of experience with solid-state design. However, I’m here to say that experience no longer applies. (Note that this design mentality wasn’t always the case. I’m old enough to remember tube-based televisions being repaired by the friendly TV repairman who came to your home, pulled off the back of your TV, replaced some tubes, and got Howdy Doody back on the screen for you.)
Some of us already know how to design adaptive, highly reliable, fault-tolerant systems. Generally, those designs go into military and aerospace hardware. Most of us do not bother. I submit that the old styles of system design, ones that cannot tolerate failure, are headed for the great scrap heap of outdated design methodologies. For many years, memory designers have put redundant rows and columns in their memory designs to boost yield. Logic designers must now do this as well. Cisco, when it designed its 188-processor SPP (a network-routing chip), added four more redundant processors to boost chip yield. That design style is currently the exception. It will rapidly become the rule.
We can and must make more intelligent use of the available on-chip transistors. Not every service call requires a shuttle launch, but every service call is costly nevertheless.
Currently no items