By Timothy Prickett Morgan
IBM and the University of New Mexico have forged a joint research project agreement that will culminate in the porting of the parallel supercomputing messaging software that turns IBM’s AIX into a parallel environment capable of supporting thousands of processors to the open source Linux operating system. The project, known as Vista Azul, will also stress test RS/6000 SP and Nefinity Linux clusters coupled together in a hypercluster to share workloads and will be a real world environment where researchers can compare and contrast these very different kinds of supercomputing iron.
We think that hybrid servers are the wave of the future, and that there are significant issues with regard to systems management that need to be addressed, said Dave Turek, director of technical strategies and business opportunities at IBM’s Server Group. Turek says that IBM wants to really examine the stability and instability issues of these two architectures – SPs are high-bandwidth, specialized servers, while a Linux cluster will use off-the-shelf commodity Intel motherboards with relatively low bandwidth – and find out what the real world mean time between failure is in each machine.
The idea is to accurately case their strengths and weaknesses, which will not only help universities and commercial institutions conduct more research on supercomputers, but will help IBM pitch the right architecture for the right job. This is not a project to say that SPs are better than Linux clusters, but a means to show what applications are appropriate to these kinds of machines, said Turek.
Right now, the University of New Mexico has a 64-node Linux cluster called Road Runner that uses Red Hat Linux and a homegrown implementation of clustering extensions that is functionally equivalent to the Beowulf clustering software available through the open source community, but which is nonetheless distinct from it. The Linux cluster was built by a local white box vendor and researchers at the University. As part of the Vista Azul project, IBM will drop in a modest eight-node SP server with 32 of its forthcoming Winterhawk-II SP nodes which are due in the first quarter 2000. The Winterhawk-IIs use IBM’s 375MHz Power3-II copper-based processors and offer 6 gigaflops of number crunching power per node for a total of 48 gigaflops. These SP nodes will be connecting using IBM’s existing SP switch, which has dual 150Mb/sec channels and, with eight nodes, a total of 2.4Gb/sec of interconnectivity bandwidth.
Over time, that SP system will grow, but exactly how many nodes depends on what applications get put on the box. Odds are, it will grow to hundreds of gigaflops over the next two years. The existing Road Runner cluster will be networked with this SP server, and IBM will also install a 512-node Nefinity cluster that will share work with these two machines. Right now, says IBM, the plan is to use four-way Intel motherboards because Linux still has problems scaling beyond that number of processors; the exact clock speeds have not been set for the Intel chips, but it is reasonable to assume that Pentium III Xeons running at 600MHz with 1Mb of L2 cache memory are the slowest chips that IBM and the university would contemplate. While IBM did not have the gigaflop ratings of the Intel-based machines, the Pentium II and Pentium III processors can do one floating point operation per second, yielding a peak theoretical rating of about 1.2 teraflops for a 2,048 processor Linux cluster and probably about 900 gigaflops where the Fortran compiler meets the road. Obviously, the Linux part of the hypercluster will have lots of power, but it will have much less bandwidth than the SP that sits beside it.
The real interesting part of the Vista Azul project is not the hardware, but the software. Rather than extending the AIX operating system so it can control the Linux hypercluster, IBM is instead going to work with University of New Mexico researchers to take the Parallel System Support Program (PSSP) kernel extensions to AIX 4.3.3 and port them to the Linux kernel. We think building extensions to Linux is the best way to implement this in the shortest amount of time, says Turek. He says that IBM could eventually offer the Linux extensions as an open source product, but that would depend on a lot of factors, not the least of which would be the availability of open source programmers to work on the project. If IBM has to foot the bill for the programming, it seems, it will want to recoup its investment by charging for the Vista Azul hypercluster code. Moreover, IBM will probably also be working on porting system management code from AIX to Linux, and this code could also end up being an open source or commercial product.