Computers, as we all know, are notoriously unforgiving when it comes to defects in their electronic components. Forget for the moment that most of the errors and crashes we see every day are the result of bad programming, more times than not due to pieces of software arguing over a hardware resource and locking up the system. If it were not for the testing procedures of semiconductor manufacturers, software would never get a chance to run at all, much less crash occasionally. Hardware defects in chips, particularly with central processors, are what has held up many past announcements. Rooting out defective parts from the piles of chips that roll out of the fab every day are what makes microprocessors very expensive, too. That’s why the bright boys at Hewlett-Packard Co’s HP Labs have invented Teramac, which is arguably the world’s first truly defect tolerant computer. Don’t get confused. Defect tolerance is not the same thing as fault tolerance. With fault tolerant computers, the hardware or software of machine in a network of similar machines can up and die and not take the whole network down; work on the network continues, although at a proportionally slower pace. (Tandem NonStops and Himalayas are the quintessential fault tolerant machines.) With defect tolerance, a computer is constructed in such a way that its processors and their connections between themselves, memory resources and the outside world can continue to function even if a large number of these chips and interconnects are shorted out or otherwise defective. The idea is to let software figure out what parts are defective and simply avoid them. As parts become defective – say, in anger, you kick the side of the Teramac server, ripping out wires and processors in a fury, effectively simulating a stroke – this same defect analysis program can be run, the defect data base updated and processing can continue, although, again, at a somewhat slower pace.

Human brain

Researchers at HP Labs published the results of their Teramac tests in the June issue of Science. The Teramac consisted of 864 identical general purpose field programmable gate arrays chips, which contain very simple circuits that can be programmed to behave like various logic elements (they are basically transistors emulated in memory). By using FPGAs rather than the normal etched blocks of solid state logic found in conventional chips, the Teramac is an extremely configurable machine. In one configuration, Teramac was configured to translate magnetic resonance data into a 3D picture of the human brain. In another, it was reconfigured (by loading a giant instruction to set all the switches in the FPGAs) to become a more generic visualization engine. The Teramac FPGAs, which were basically pulled out of the dumpster of a Silicon Valley chip manufacturer, were interlinked into a massively parallel supercomputer. This interlinking was far more intricate than the symmetric multiprocessing (SMP) clustering used in modern servers, but a lot less complicated (but far too complex to describe in less than three paragraphs, so you’ll have to read Science yourself). The chips ran at a measly 1 megahertz but, because Teramac uses very long instruction word (VLIW) programming techniques employing 300 megabit instructions, the resulting server provided 100 times the power of a high-speed Unix workstation. It did so even though the Teramac computer had 220,000 known defects in its chips and interconnects, any one of which would have been fatal to a traditional computer. As it was, only about 3 percent of Teramac’s 7.7 million resources (arrays and interconnects) were defective, but HP believes that as much as 50% of these resources could have been faulty and Teramac would have still worked.

Moore’s Law

The Teramac puts to test a lot of ideas, all of which will be useful as we move into molecular and nanotechnology computers in the next millennium. As silicon and copper wires shrink down to the 0.10 micron range in the coming years, it will get harder and more expensive to built chips without defects – costs for silicon fabs capable of producing perfect chips are increasing exponentially over time; Sematech expects the cost of a CMOS fab to be $30bn or more by 2012. We may be able to play Moore’s Law until then, but the question is, can we afford it? The Teramac server and its flexible and powerful design calls into question the whole way Intel, IBM, Motorola, Compaq and others go about designing chips. It may make sense, from an economic point of view, to explore this idea of defect tolerance at a system and chip level. Prices for Teramac-style servers would not be dependent on progressively smaller and more expensive CMOS processes and the resulting chip clock speeds they enable, but rather on the number of defects in the box since computing power scales linearly in the Teramac with the number of devices working properly, and fortunately, so does compile times of programs. Perhaps more significantly, Teramac points the way to how a molecular or nanotechnology computer is going to probably have to be built. There is no way to make defect-free molecules or nanodevices, and even if there was, it would certainly not be economical. There will not be time enough in the universe to check for defects in the trillions upon trillions of molecules or nanomachines that will be used in even a modestly powerful nanocomputer. As a result, most computer researchers believe that some way has to be found to build such nanocomputers nonetheless, and it looks like Teramac is a very good starting point.

á