Hopkins to build data analysis super machine
- 04 November, 2010 01:09
Disregarding the supercomputing community's insatiable thirst for FLOPS (floating point operations per second), the Baltimore-based Johns Hopkins University is configuring its new machine to achieve the maximum number of IOPS (I/O operations per second) instead.
The novel design will be better suited to the kind of data-mining-oriented scientific workloads processed by today's supercomputers, argued Alexander Szalay, a computer scientist and astrophysicist at Johns Hopkins' Institute for Data Intensive Engineering and Science, who is leading the project.
"For the sciences, it is the I/O that is becoming the major bottleneck," he explained. "People are running larger and larger simulations, and they take up so much memory, it is difficult to write the output to disk."
The U.S. National Science Foundation (NSF) has provided US$2.1 million for the system, called Data-Scope. Hopkins itself is contributing $1 million as well.
Thus far, 20 research groups within Hopkins have indicated they could use the system to study problems in genomics, ocean circulation, turbulence, astrophysics and environmental science. The university will also allow outside organizations to use the machine. Data-Scope is expected to go live by next May.
FLOPS measures the amount of floating point calculations a computer can do in a second, an essential tool for analyzing large amounts of data. But IOPS measures the amount of data that can be moved on and off a computer.
By maximizing IOPS, the new system will "enable data analysis tasks that are simply not possible today," the researchers stated in the proposal.
Today, most researchers are limited to analyzing datasets only up to 10 terabytes in size, while larger datasets, such as those that are 100 terabytes or more, can only be investigated by a handful of the largest supercomputers. Hopkins' novel configuration of hardware might offer a lower cost way to analyze such big datasets, Szalay said.
The machine, once built, will have a total I/O bandwidth of 400 to 500 gigabytes per second, approximately more than twice that of the fastest computer, Oak Ridge National Laboratory's Jaguar, on the Top 500 ranking of the world's most powerful computers.
Data-Scope, however, will only offer a peak performance of about 600 teraflops, far short of Jaguar's 1.75 Petaflops.
In Hopkins' design, each server will have 24 dedicated hard disk drives as well as four solid state disks, which in total can provide 4.4 gigabytes per second across the chassis bus directly to two GPUs (graphics processing units), which will do much of the calculations.
Overall, the system will have about 100 of these machines and about five petabytes in storage total.
To guide the design, the team used a rule-of-thumb devised by computer scientist Gene Amdahl. Ideally, Amdahl posited, a computer should have one I/O bit ready for each instruction it executes.
Most supercomputer architects have disregarded this rule, claiming the processor caches can bank data and have it ready for use when needed. Now that datasets have grown so large, Amdahl's rule should be reconsidered, Szalay argued.
A typical Amdahl number for a supercomputer would be an Amdahl .001, or a thousandth of the optimal balance, whereas Data-Scope should have an Amdahl number of about .6 or .7.
The designers also plan to make some changes in the way databases are used. "We don't use the database just as dump storage but as an active computing environment," Szalay said. Instead of moving data from a database across a network to a cluster of servers, researchers can write user-defined functions that can run against the database itself.
Researchers can use one of three images that can be booted on the system: Windows Server 2008, a combination of Linux and MySQL and a third instance running Hadoop.
Data-Scope will be housed in a new campus green data center being built with $1.3 million in funding from the NSF.