HDFS (Hadoop Distributed File System)
The Hadoop Distributed File System offers a basic framework for splitting up data collections between multiple nodes while using replication to recover from node failure. The large files are broken into blocks, and several nodes may hold all of the blocks from a file. The image from the Apache documentation at left shows how blocks are distributed across multiple nodes.
The file system is designed to mix fault tolerance with high throughput. The blocks are loaded to maintain steady streaming and are not usually cached to minimize latency. The default model imagines long jobs processing lots of locally stored data. This fits the project's motto that “moving computation is cheaper than moving data.”
The HDFS is also distributed under the Apache license from http://hadoop.apache.org/.