Cray Research Inc claims that its forthcoming Alpha RISC-based massively parallel processing system due next year (CI No 2,037) will be the world’s first multi-purpose massively parallel system useful for real production work. As evidence of current deficiencies, company officials cited recent studies done at leading national laboratories, in which Cray’s Y-MP C90 parallel vector supercomputer systems outperformed current massively parallel machines even on highly parallel programs. The company says that in a recent NASA Ames Research Center study, the C90 system outperformed current parallel systems on a series of parallel problems, including one termed embarrassingly parallel. The 16-processor C90 system is consistently the highest performing system tested, far surpassing any of the highly parallel systems, the study concluded, adding that the performance rates on the highly parallel systems are typically only 2% to 5% of the theoretical peak performance of these systems, versus more than 50% in some cases for the C90 system. The results show that on these codes, only the Cray currently demonstrates GigaFLOPS performance, according to a September 1992 article in IEEE Spectrum. Similar results were claimed for a study at the Los Alamos National Laboratory, and both studies identified the same major performance bottlenecks of the current massively parallel systems: immature software compilers, inadequate memory bandwidth, and insufficient bandwidth between the individual processors. To help customers develop programs for Cray’s massively parallel system before it arrives next year, the company has developed an emulator that enables users to run massively parallel applications on their Cray Y-MP systems. The emulator is designed to help developers to write more efficient parallel code, by providing feedback on data layout, data locality, and data reference patterns. The high-speed interprocessor communications network will link the processing elements to distribute and access global data, using the same high-performance switch technology as the Cray Y-MP processor-memory interface and operates at the same 150MHz clock speed as the Digital Equipment Corp RISC nodes to provide extremely fast non-local access, Cray says. It adds that a key feature of the system will be a three-dimensional interconnect network that increases bandwidth and minimises network distances. The torus, it is claimed, will give Cray’s system the highest bisection bandwidth of any known massively parallel system by keeping the nodes close to each other, avoiding the far neighbour communication delays found in other systems. The machine, code-named T3D, uses high-performance switch nodes that handle interprocessor communications without interrupting the processing elements. Bi-directionally in each dimension Each switch node can operate bi-directionally in each dimension. It will also have globally addressable, physically distributed memory and because this is logically shared, any processing element can access the memory of any other processing element without explicit message passing, and without involving the remote processing element, so that the system can be scaled to address Terabytes of memory. And to help sustain high performance, special communication hardware will enable data in remote processing elements to be moved to a local element before it is needed, providing a logical cache. The T3D systems will connect to the company’s input-output subsystems with multiple high-speed channels, and each processing element will have a microkernel that manages communications with other processing elements and with the closely coupled Cray Y-MP vector processors, and the machines will be scalable from tens to hundreds to thousands of processing elements. Software-configurable redundant hardware will be included so that processing can continue, without hardware maintenance, should a processing element fail. The markets it is pitching for include seismic data processing for petroleum exploration; atmospheric modelling for weather prediction a
nd climate research; computational fluid dynamics and structural analysis for the aerospace and automotive industries; computational chemistry for drug design and materials science applications; and computational electromagnetics. Cray plans call for delivering the first-phase T3D system in 1993, doing 150GFLOPS peak in a 1024-processor configuration, scalable to 300GFLOPS peak in a 2048-processor version at something in the $25m to $30m range; the second-phase system in mid-decade, doing 1 TFLOPS peak; and a third-phase system in 1997 for sustained TFLOPS performance.